@Channelchan 2017-03-08T11:55:40.000000Z 字数 4326 阅读 26199

StatsModels 描述性统计

Distribution

描述性统计

描述性统计是用来概括、表述事物整体状况以及事物间关联、类属关系的统计方法。通过统计处理可以简洁地用几个统计值来表示一组数据地集中性和离散型(波动性大小)。

类型	描述	例子	结果
算术均值 Arithmetic Mean	数据的和除以数据的数量	(1+2+2+3+4+7+9) / 7	4
中值 Median	中间的那个值，把数据分成大小两半	1, 2, 2, 3, 4, 7, 9	3
众数 Mode	频度最大的那个数	1, 2, 2, 3, 4, 7, 9	2

百分位数 Percentile：
例子：如第五百分位，它表示在所有测量数据中，测量值的累计频次达5%。以回报率为例，回报率分布的第五百分位表示有5%的回报率小于此测量值，95%的回报率大于此测量值。

方差与标准差：
各数据偏离平均数的距离（离均差）的平均数。

$\sigma^2=\frac{1}{n-1}\sum^n_{i=1}(|x_i-\bar{x}|)^2$

$\sigma=\sqrt{\frac{1}{n-1}\sum^n_{i=1}(|x_i-\bar{x}|)^2}$

用000001的股票收益率来做实例：

import tushare as ts
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = ts.get_k_data('000001', start='2016-01-01', end='2016-12-31', ktype='D', autype='qfq')
data.index = pd.to_datetime(data['date'], format='%Y-%m-%d')
data['returns'] = data['close'].pct_change()
r = data.returns.dropna()
mean = r.mean()
median = r.median()
mode = r.mode()
std = r.std()
print('mean', mean)
print('median', median)
print('mode', mode)
print('std', std)

('mean', -1.1195727976130874e-05)
('median', 0.0009397392223660095)
('mode', 0 0.0
dtype: float64)
('std', 0.011625261186742808)

随机变量

离散随机变量（Discrete Random Variables）
摇一颗骰子100次，k次摇到6，k是随机变量，k的取值只能是自然数0，1，2，…，6而不能取小数3.5，因而k是离散型随机变量。

#np.random.random_integers(low, high, numberOfSamples)
DiscreteRandomVariable = np.random.random_integers(1, 6, 100)
plt.hist(DiscreteRandomVariable , bins = [1,2,3,4,5,6,7])
plt.xlabel('Value')
plt.ylabel('Occurences')
plt.legend(['DiscreteRandomVariable'])
plt.show()

连续随机变量（Continuous Random Variables）
比如，公共汽车每15分钟一班，有100个人在站台等车时间x是个随机变量，x的取值范围是[0,15]，它是一个区间，从理论上说在这个区间内可取任一实数3.5、√20等，因而称这随机变量是连续型随机变量。

##np.random.uniform(low, high, numberOfSamples)
CRV = np.random.uniform(0, 15, 100)
plt.plot(CRV)
plt.xlabel('Time')
plt.ylabel('People')
plt.legend(['ContinuousRandomVariable'])
plt.show()

分布

二项分布(Binomial Distribution)
投一个硬币，0.5概率向上，投了5次，重复50次，每一轮向上的次数有多少？

#np.random.binomial(numberOfTrials, probabilityOfSuccess, numberOfSamples)
BinomialRandomVariable = np.random.binomial(5, 0.50,50)
plt.hist(BinomialRandomVariable, bins = [0, 1, 2, 3, 4, 5, 6],align='left')
plt.xlabel('Value')
plt.ylabel('Occurences')
plt.legend(['BinomialRandomVariable'])
plt.show()

均匀分布（Uniform Distribution）
从一副牌里抽到黑桃,PDF：

a = 0.0
b = 13.0
x = np.linspace(a, b, 14)
y = [1/(b-a) for i in x]
plt.plot(x, y)
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()

正态分布（Normal Distribution）
用公式与模块代码方法生成不同标准差的正态分布图形，展示对比了解图形。
Formula:

$f(x)=\frac{1}{\sigma\sqrt{2\pi}}*e^\frac {-(x-\mu)^2}{2\sigma^2}$

from scipy import stats
mu_1 = 0
mu_2 = 0
sigma_1 = 1
sigma_2 = 2
x = np.linspace(-8, 8, 200)
y = (1/(sigma_1 * np.sqrt(2 * 3.14159))) * np.exp(-(x-mu_1)*(x-mu_1) / (2 * sigma_1 * sigma_1))
z = stats.norm.pdf(x,mu_2,sigma_2)
plt.plot(x, y, x, z)
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()

以正态分布的均值和标准差随机生成一定数量的样本。

y=np.random.normal(mu_1, sigma_1, numberOfSamples)
z=np.random.normal(mu_2, sigma_2, numberOfSamples)

检验分布
检查股票000001的回报率是否满足正态分布，得出回报率的峰态值，最后把回报率的分布度显示出来。

from statsmodels.stats.stattools import jarque_bera
prices = ts.get_k_data('000001', start='2016-01-01', end='2016-12-31', ktype='D',autype='qfq')
prices.index = pd.to_datetime(prices['date'],format='%Y-%m-%d')
# Take the daily returns
returns = prices['close'].pct_change()[1:]
#Set a cutoff
cutoff = 0.01
# Get the pvalue of the JB test
_, p_value, skewness, kurtosis = jarque_bera(returns)
print "The JB test pvalueis: ", p_value
print "We reject the hypothesis that the data are normally distributed ", p_value < cutoff
print "The skewness of the returns is: ", skewness
print "The kurtosis of the returns is: ", kurtosis
plt.hist(returns, bins = 20)
plt.xlabel('Value')
plt.ylabel('Occurrences')
plt.show()

The JB test pvalue is: 4.05367795768e-38
We reject the hypothesis that the data are normally distributed True
The skewness of the returns is: -0.589166660049
The kurtosis of the returns is: 6.95204544689

抽取样本的均值与标准差，画出正态分布图。

# Take the sample mean and standard deviation of the returns
sample_mean = np.mean(returns)
sample_std_dev = np.std(returns)
x = np.linspace(-(sample_mean + 4 * sample_std_dev), (sample_mean + 4 * sample_std_dev), len(returns))
sample_distribution = ((1/(sample_std_dev * 2 * np.pi)) *
                       np.exp(-(x-sample_mean) * (x-sample_mean) / (2 * sample_std_dev * sample_std_dev)))
plt.hist(returns, bins = 20, normed = True)
plt.plot(x, sample_distribution)
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()

t分布
t分布概率密度曲线以0为中心，左右对称的单峰分布；期形态变化与自由度大小有关，自由度越小，分布越散，而自由度与样本量和变量数量有关。当自由度为30是，t分布已经接近标准的正太分布曲线,当自由度为120时，等于正太分布。一般自由度为1-30,40,50,...,120。与标准的正态分布相比，t分布呈现‘尖峰厚尾’的特点，更加符合收益分布的特点，因此在推断统计分析中常常会使用t分布。
显示对比不同自由度的t分布。

x = np.arange(-4,4.004,0.004)
plt.plot(x,stats.t.pdf(x, 5),'r')
plt.plot(x,stats.t.pdf(x, 30))
plt.plot(x,stats.t.pdf(x, 120),'y')

StatsModels 描述性统计

描述性统计

随机变量

分布

内容目录