@Channelchan
2017-11-27T14:18:10.000000Z
字数 7914
阅读 138717
处理系列的许多方面延续到DataFrame。pandas的DataFrame让我们轻松地管理我们的数据和他们的直觉结构。
DataFrames可以容纳多种类型的数据,但DataFrames是二维的对象,不同于series。
字典转DataFrames的例子:
import numpy as np
import pandas as pd
dict_data = {
'a' : [1, 2, 3, 4, 5],
'b' : ['L', 'K', 'J', 'M', 'Z'],
'c' : np.random.normal(0, 1, 5)
}
print (dict_data)
frame_data = pd.DataFrame(dict_data, index=pd.date_range('20160101',periods=5))
print (frame_data)
{'a': [1, 2, 3, 4, 5], 'b': ['L', 'K', 'J', 'M', 'Z'], 'c': array([ 0.59663926, -0.79131971, -0.81207677, -1.61454126, -0.14766187])}
a b c
2016-01-01 1 L 0.596639
2016-01-02 2 K -0.791320
2016-01-03 3 J -0.812077
2016-01-04 4 M -1.614541
2016-01-05 5 Z -0.147662
Series组合成DataFrame的例子:
s_1 = pd.Series([2, 4, 6, 8, 10], name='APPL')
s_2 = pd.Series([1, 3, 5, 7, 9], name="GOOG")
numbers = pd.concat([s_1, s_2], axis=1)
print (numbers)
print (type(numbers))
APPL GOOG
0 2 1
1 4 3
2 6 5
3 8 7
4 10 9
<class 'pandas.core.frame.DataFrame'>
print (numbers.columns)
# 修改列名
numbers.columns = ['MSFT', 'YHOO']
print (numbers)
Index(['APPL', 'GOOG'], dtype='object')
MSFT YHOO
0 2 1
1 4 3
2 6 5
3 8 7
4 10 9
print (numbers.index)
# 修改索引
numbers.index = pd.date_range("20160101",periods=len(numbers))
print (numbers)
# 读取Dataframe的值
print (numbers.values)
RangeIndex(start=0, stop=5, step=1)
MSFT YHOO
2016-01-01 2 1
2016-01-02 4 3
2016-01-03 6 5
2016-01-04 8 7
2016-01-05 10 9
[[ 2 1]
[ 4 3]
[ 6 5]
[ 8 7]
[10 9]]
关键在于一切现在都要考虑多个维度。发生这种情况的主要途径是通过访问DataFrame单独或整组的列,。要做到这一点,
我们可以通过直接访问属性或通过使用我们已经熟悉的方法。
用tushare请求A股数据,然后做成dataframe,最后通过loc与iloc读取需要的数据。
stock1 = pd.read_excel('sz50.xlsx',sheetname='600036.XSHG', index_col='datetime')
stock2 = pd.read_excel('sz50.xlsx',sheetname='600050.XSHG', index_col='datetime')
stock3 = pd.read_excel('sz50.xlsx',sheetname='601318.XSHG', index_col='datetime')
from datetime import datetime
symbol=['600036.XSHG','600050.XSHG','601318.XSHG']
data_dict = {}
for s in symbol:
data = pd.read_excel('sz50.xlsx',sheetname=s, index_col='datetime')
data_dict[s] = data['close']
data = pd.DataFrame(data_dict)
print(data.loc[datetime(2017,1,1):datetime(2017,1,10),['600036.XSHG', '601318.XSHG']])
print(data.iloc[0:2,1])
print(data.iloc[[1, 3, 5] + list(range(7, 20, 2)), [0, 1]].head(20))
600036.XSHG 601318.XSHG
datetime
2017-01-03 15:00:00 69.31 79.89
2017-01-04 15:00:00 69.42 79.87
2017-01-05 15:00:00 69.85 80.02
2017-01-06 15:00:00 69.35 79.38
2017-01-09 15:00:00 69.23 79.38
datetime
2017-01-03 15:00:00 8.99
2017-01-04 15:00:00 8.98
Name: 600050.XSHG, dtype: float64
600036.XSHG 600050.XSHG
datetime
2017-01-04 15:00:00 69.42 8.98
2017-01-06 15:00:00 69.35 9.27
2017-01-10 15:00:00 69.23 9.13
2017-01-12 15:00:00 69.46 8.06
2017-01-16 15:00:00 71.62 7.97
2017-01-18 15:00:00 71.74 8.00
2017-01-20 15:00:00 71.66 7.64
2017-01-24 15:00:00 72.78 7.64
2017-01-26 15:00:00 73.32 7.91
2017-02-06 15:00:00 71.78 7.88
与Series一样,有时候我们想过滤DataFrame根据一组标准。我们通过索引DataFrame布尔值。
print(data.loc[data['600036.XSHG'].pct_change() > data['601318.XSHG'].pct_change()].head())
600036.XSHG 600050.XSHG 601318.XSHG
datetime
2017-01-04 15:00:00 69.42 8.98 79.87
2017-01-05 15:00:00 69.85 9.48 80.02
2017-01-06 15:00:00 69.35 9.27 79.38
2017-01-10 15:00:00 69.23 9.13 78.90
2017-01-11 15:00:00 69.27 8.34 78.90
当你已经有一个DataFrame的数据,这很好,但同样重要的是能够增加你的数据。
添加新数据:
new = pd.read_excel('sz50.xlsx',sheetname='600519.XSHG', index_col='datetime')
data['600519.XSHG'] = new.close
print(data.head(5))
600036.XSHG 600050.XSHG 601318.XSHG 600519.XSHG
datetime
2017-01-03 15:00:00 69.31 8.99 79.89 2078.80
2017-01-04 15:00:00 69.42 8.98 79.87 2186.35
2017-01-05 15:00:00 69.85 9.48 80.02 2155.78
2017-01-06 15:00:00 69.35 9.27 79.38 2180.14
2017-01-09 15:00:00 69.23 9.31 79.38 2165.23
删除某一行
data = data.drop('600050.XSHG', axis=1)
print(data.head(5))
600036.XSHG 601318.XSHG 600519.XSHG
datetime
2017-01-03 15:00:00 69.31 79.89 2078.80
2017-01-04 15:00:00 69.42 79.87 2186.35
2017-01-05 15:00:00 69.85 80.02 2155.78
2017-01-06 15:00:00 69.35 79.38 2180.14
2017-01-09 15:00:00 69.23 79.38 2165.23
合并某一行
gold_stock = pd.read_excel('sz50.xlsx',sheetname='600547.XSHG', index_col='datetime')
df=pd.concat([data,gold_stock['close']], axis=1)
print(df.head(5))
600036.XSHG 601318.XSHG 600519.XSHG close
datetime
2017-01-03 15:00:00 69.31 79.89 2078.80 323.38
2017-01-04 15:00:00 69.42 79.87 2186.35 324.61
2017-01-05 15:00:00 69.85 80.02 2155.78 330.94
2017-01-06 15:00:00 69.35 79.38 2180.14 327.69
2017-01-09 15:00:00 69.23 79.38 2165.23 323.46
df.rename(columns={'close':'600547.XSHG'}, inplace = True)
print(df.head(5))
600036.XSHG 601318.XSHG 600519.XSHG 600547.XSHG
datetime
2017-01-03 15:00:00 69.31 79.89 2078.80 323.38
2017-01-04 15:00:00 69.42 79.87 2186.35 324.61
2017-01-05 15:00:00 69.85 80.02 2155.78 330.94
2017-01-06 15:00:00 69.35 79.38 2180.14 327.69
2017-01-09 15:00:00 69.23 79.38 2165.23 323.46
把一个真实数据输入DataFrame带给我们与在系列中同样的问题,只是这次更多的维度。我们有和系列相同的方法来访问,如下显示。
print(df[df.isnull().values==True])
600036.XSHG 601318.XSHG 600519.XSHG 600547.XSHG
datetime
2017-04-06 15:00:00 73.44 82.76 2435.12 NaN
2017-04-07 15:00:00 72.97 81.97 2441.39 NaN
2017-04-10 15:00:00 73.01 81.61 2417.22 NaN
2017-04-11 15:00:00 73.36 81.03 2394.98 NaN
2017-04-12 15:00:00 73.51 80.63 2424.62 NaN
2017-04-13 15:00:00 73.01 80.43 2476.43 NaN
2017-04-14 15:00:00 72.63 80.38 2456.92 NaN
2017-04-17 15:00:00 72.90 80.67 2459.84 NaN
2017-04-18 15:00:00 71.78 79.96 2511.41 NaN
2017-04-19 15:00:00 71.31 79.51 2521.23 NaN
2017-04-20 15:00:00 70.93 80.18 2580.63 NaN
2017-04-21 15:00:00 72.51 80.79 2492.65 NaN
2017-04-24 15:00:00 73.05 81.01 2478.98 NaN
2017-04-25 15:00:00 73.17 81.66 2603.24 NaN
2017-04-26 15:00:00 73.71 84.01 2597.03 NaN
2017-04-27 15:00:00 73.24 84.71 2621.69 NaN
2017-04-28 15:00:00 73.78 85.00 2565.84 NaN
2017-05-02 15:00:00 73.36 84.93 2559.81 NaN
2017-05-03 15:00:00 73.05 85.11 2594.92 NaN
2017-05-04 15:00:00 72.36 84.68 2585.10 NaN
2017-05-05 15:00:00 72.51 83.41 2586.47 NaN
2017-05-08 15:00:00 72.51 84.59 2516.57 NaN
2017-05-09 15:00:00 72.47 85.11 2541.11 NaN
2017-05-10 15:00:00 73.40 89.36 2547.88 NaN
2017-05-11 15:00:00 74.33 89.07 2560.74 NaN
2017-05-12 15:00:00 78.49 91.72 2568.88 NaN
2017-05-15 15:00:00 79.80 91.65 2606.84 NaN
2017-05-16 15:00:00 79.11 92.05 2673.14 NaN
2017-05-17 15:00:00 77.91 90.53 2650.46 NaN
df_na = df.fillna(method='ffill')
print(df_na.loc['2017-04-26':'2017-05-17'])
600036.XSHG 601318.XSHG 600519.XSHG 600547.XSHG
datetime
2017-04-26 15:00:00 73.71 84.01 2597.03 317.66
2017-04-27 15:00:00 73.24 84.71 2621.69 317.66
2017-04-28 15:00:00 73.78 85.00 2565.84 317.66
2017-05-02 15:00:00 73.36 84.93 2559.81 317.66
2017-05-03 15:00:00 73.05 85.11 2594.92 317.66
2017-05-04 15:00:00 72.36 84.68 2585.10 317.66
2017-05-05 15:00:00 72.51 83.41 2586.47 317.66
2017-05-08 15:00:00 72.51 84.59 2516.57 317.66
2017-05-09 15:00:00 72.47 85.11 2541.11 317.66
2017-05-10 15:00:00 73.40 89.36 2547.88 317.66
2017-05-11 15:00:00 74.33 89.07 2560.74 317.66
2017-05-12 15:00:00 78.49 91.72 2568.88 317.66
2017-05-15 15:00:00 79.80 91.65 2606.84 317.66
2017-05-16 15:00:00 79.11 92.05 2673.14 317.66
2017-05-17 15:00:00 77.91 90.53 2650.46 317.66
使用安装在内部的统计方法来计算DataFrames,我们可以对多个时间序列进行计算。执行计算的代码在DataFrames与在series上几乎一模一样,所以不要担心再学习一切。
将DataFrame数据可视化:
import matplotlib.pyplot as plt
onebegin=data/data.iloc[0]
onebegin.plot()
plt.title("Onebegin Stock Prices")
plt.ylabel("Price")
plt.xlabel("Date")
plt.show()
对dataframe数据进行计算:
print('mean:','\n',data.mean(axis=0))
print('std:','\n',data.std(axis=0))
print(onebegin.head(5))
mean:
600036.XSHG 88.889860
601318.XSHG 104.962372
600519.XSHG 2823.996047
dtype: float64
std:
600036.XSHG 15.039725
601318.XSHG 22.705049
600519.XSHG 532.922368
dtype: float64
600036.XSHG 601318.XSHG 600519.XSHG
datetime
2017-01-03 15:00:00 1.000000 1.000000 1.000000
2017-01-04 15:00:00 1.001587 0.999750 1.051737
2017-01-05 15:00:00 1.007791 1.001627 1.037031
2017-01-06 15:00:00 1.000577 0.993616 1.048749
2017-01-09 15:00:00 0.998846 0.993616 1.041577
将回报率标准化,然后可视化。
mult_returns = data.pct_change()[1:]
print(mult_returns.head(5))
600036.XSHG 601318.XSHG 600519.XSHG
datetime
2017-01-04 15:00:00 0.001587 -0.000250 0.051737
2017-01-05 15:00:00 0.006194 0.001878 -0.013982
2017-01-06 15:00:00 -0.007158 -0.007998 0.011300
2017-01-09 15:00:00 -0.001730 0.000000 -0.006839
2017-01-10 15:00:00 0.000000 -0.006047 0.001894
norm_returns = (mult_returns-mult_returns.mean(axis=0))/mult_returns.std(axis=0)
plt.plot(norm_returns)
plt.hlines(0, norm_returns.index[0],norm_returns.index[-1], linestyles='dashed')
plt.show()
将dataframe里的数据计算40均线,最后可视化展示出来:
rolling_mean = data['601318.XSHG'].rolling(window=40,center=False).mean()
data['601318.XSHG'].plot()
rolling_mean.plot()
plt.title("40days Rolling Mean of 601318.XSHG")
plt.xlabel("Date")
plt.ylabel("Price")
plt.legend()
plt.show()