@Channelchan
2018-09-30T13:30:04.000000Z
字数 6633
阅读 78060
series是处理一维数据,dataframe是处理二维数据。
字典转DataFrames
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
dict_data = {
'a' : [1, 2, 3, 4, 5],
'b' : ['L', 'K', 'J', 'M', 'Z'],
'c' : np.random.randn(5) }
print (dict_data)
frame_data = pd.DataFrame(dict_data, index=pd.date_range('20160101',periods=5))
print (frame_data)
{'a': [1, 2, 3, 4, 5], 'b': ['L', 'K', 'J', 'M', 'Z'], 'c': array([-1.14834319, -0.60200001, 0.17257141, 0.52351166, -0.15281412])}
a b c
2016-01-01 1 L -1.148343
2016-01-02 2 K -0.602000
2016-01-03 3 J 0.172571
2016-01-04 4 M 0.523512
2016-01-05 5 Z -0.152814
dic = {}
dic['name'] = frame_data
s_1 = pd.Series([2, 4, 6, 8, 10], name='APPL')
s_2 = pd.Series([1, 3, 5, 7, 9], name="GOOG")
numbers = pd.concat([s_1, s_2], axis=1)
print (numbers)
print (type(numbers))
APPL GOOG
0 2 1
1 4 3
2 6 5
3 8 7
4 10 9
<class 'pandas.core.frame.DataFrame'>
直接创建
df = pd.DataFrame(np.random.randn(5,2),index=range(0,10,2),columns=list('AB'))
print (numbers.columns)
# 修改列名
numbers.columns = ['MSFT', 'YHOO']
print (numbers)
Index(['APPL', 'GOOG'], dtype='object')
MSFT YHOO
0 2 1
1 4 3
2 6 5
3 8 7
4 10 9
print (numbers.index)
# 修改索引
numbers.index = pd.date_range("20160101",periods=len(numbers))
print (numbers)
# 读取Dataframe的值
print (numbers.values)
RangeIndex(start=0, stop=5, step=1)
MSFT YHOO
2016-01-01 2 1
2016-01-02 4 3
2016-01-03 6 5
2016-01-04 8 7
2016-01-05 10 9
[[ 2 1]
[ 4 3]
[ 6 5]
[ 8 7]
[10 9]]
numbers.iloc[2,1]
5
loc works on labels in the index.
**iloc works on the positions in the index (so it only takes integers). **
df = pd.DataFrame(np.random.randn(5,2),index=range(0,10,2),columns=list('AB'))
print(df)
A B
0 0.778388 -0.589612
2 -0.663451 -0.909238
4 -1.936702 0.266998
6 0.292892 -0.516285
8 0.269110 -0.626136
print(df.iloc[[2]])
A B
4 -1.936702 0.266998
print(df.loc[[2]])
A B
2 -0.663451 -0.909238
data = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]},index=["a","b","c"])
print(data)
A B C
a 1 4 7
b 2 5 8
c 3 6 9
data.loc["b","B"]
5
data.iloc[1,1]
5
print(data.loc['b':'c','B':'C'])
B C
b 5 8
c 6 9
print(data.iloc[1:3,1:3])
B C
b 5 8
c 6 9
print(data.ix[1,1])
print(data.ix["b","B"])
5
5
print(data.ix[1:3,1:3])
print(data.ix['b':'c','B':'C'])
B C
b 5 8
c 6 9
B C
b 5 8
c 6 9
与Series一样,有时候我们想过滤DataFrame根据一组标准。我们通过索引DataFrame布尔值。
print(data.loc[data['600036.XSHG'].pct_change() > data['601318.XSHG'].pct_change()].head())
600036.XSHG 600050.XSHG 601318.XSHG
datetime
2017-01-04 15:00:00 69.42 8.98 79.87
2017-01-05 15:00:00 69.85 9.48 80.02
2017-01-06 15:00:00 69.35 9.27 79.38
2017-01-10 15:00:00 69.23 9.13 78.90
2017-01-11 15:00:00 69.27 8.34 78.90
当你已经有一个DataFrame的数据,这很好,但同样重要的是能够增加你的数据
添加新数据
new = pd.read_excel('sz50.xlsx',sheetname='600519.XSHG', index_col='datetime')
data['600519.XSHG'] = new.close
print(data.head(5))
600036.XSHG 600050.XSHG 601318.XSHG 600519.XSHG
datetime
2017-01-03 15:00:00 69.31 8.99 79.89 2078.80
2017-01-04 15:00:00 69.42 8.98 79.87 2186.35
2017-01-05 15:00:00 69.85 9.48 80.02 2155.78
2017-01-06 15:00:00 69.35 9.27 79.38 2180.14
2017-01-09 15:00:00 69.23 9.31 79.38 2165.23
删除某一列
data = data.drop('600050.XSHG', axis=1)
print(data.head(5))
600036.XSHG 601318.XSHG 600519.XSHG
datetime
2017-01-03 15:00:00 69.31 79.89 2078.80
2017-01-04 15:00:00 69.42 79.87 2186.35
2017-01-05 15:00:00 69.85 80.02 2155.78
2017-01-06 15:00:00 69.35 79.38 2180.14
2017-01-09 15:00:00 69.23 79.38 2165.23
合并某一列
gold_stock = pd.read_excel('sz50.xlsx',sheetname='600547.XSHG', index_col='datetime')
df=pd.concat([data,gold_stock['close']], axis=1)
print(df.head(5))
600036.XSHG 601318.XSHG 600519.XSHG close
datetime
2017-01-03 15:00:00 69.31 79.89 2078.80 323.38
2017-01-04 15:00:00 69.42 79.87 2186.35 324.61
2017-01-05 15:00:00 69.85 80.02 2155.78 330.94
2017-01-06 15:00:00 69.35 79.38 2180.14 327.69
2017-01-09 15:00:00 69.23 79.38 2165.23 323.46
df.rename(columns={'close':'600547.XSHG'}, inplace = True)
print(df.head(5))
600036.XSHG 601318.XSHG 600519.XSHG 600547.XSHG
datetime
2017-01-03 15:00:00 69.31 79.89 2078.80 323.38
2017-01-04 15:00:00 69.42 79.87 2186.35 324.61
2017-01-05 15:00:00 69.85 80.02 2155.78 330.94
2017-01-06 15:00:00 69.35 79.38 2180.14 327.69
2017-01-09 15:00:00 69.23 79.38 2165.23 323.46
把一个真实数据输入DataFrame带给我们与在系列中同样的问题,只是这次更多的维度。我们有和系列相同的方法来访问,如下显示。
print(df[df.isnull().values==True])
600036.XSHG 601318.XSHG 600519.XSHG 600547.XSHG
datetime
2017-04-06 15:00:00 73.44 82.76 2435.12 NaN
2017-04-07 15:00:00 72.97 81.97 2441.39 NaN
2017-04-10 15:00:00 73.01 81.61 2417.22 NaN
2017-04-11 15:00:00 73.36 81.03 2394.98 NaN
2017-04-12 15:00:00 73.51 80.63 2424.62 NaN
2017-04-13 15:00:00 73.01 80.43 2476.43 NaN
2017-04-14 15:00:00 72.63 80.38 2456.92 NaN
2017-04-17 15:00:00 72.90 80.67 2459.84 NaN
2017-04-18 15:00:00 71.78 79.96 2511.41 NaN
2017-04-19 15:00:00 71.31 79.51 2521.23 NaN
2017-04-20 15:00:00 70.93 80.18 2580.63 NaN
2017-04-21 15:00:00 72.51 80.79 2492.65 NaN
2017-04-24 15:00:00 73.05 81.01 2478.98 NaN
2017-04-25 15:00:00 73.17 81.66 2603.24 NaN
2017-04-26 15:00:00 73.71 84.01 2597.03 NaN
2017-04-27 15:00:00 73.24 84.71 2621.69 NaN
2017-04-28 15:00:00 73.78 85.00 2565.84 NaN
2017-05-02 15:00:00 73.36 84.93 2559.81 NaN
2017-05-03 15:00:00 73.05 85.11 2594.92 NaN
2017-05-04 15:00:00 72.36 84.68 2585.10 NaN
2017-05-05 15:00:00 72.51 83.41 2586.47 NaN
2017-05-08 15:00:00 72.51 84.59 2516.57 NaN
2017-05-09 15:00:00 72.47 85.11 2541.11 NaN
2017-05-10 15:00:00 73.40 89.36 2547.88 NaN
2017-05-11 15:00:00 74.33 89.07 2560.74 NaN
2017-05-12 15:00:00 78.49 91.72 2568.88 NaN
2017-05-15 15:00:00 79.80 91.65 2606.84 NaN
2017-05-16 15:00:00 79.11 92.05 2673.14 NaN
2017-05-17 15:00:00 77.91 90.53 2650.46 NaN
df_na = df.fillna(method='ffill')
print(df_na.loc['2017-04-26':'2017-05-17'])
600036.XSHG 601318.XSHG 600519.XSHG 600547.XSHG
datetime
2017-04-26 15:00:00 73.71 84.01 2597.03 317.66
2017-04-27 15:00:00 73.24 84.71 2621.69 317.66
2017-04-28 15:00:00 73.78 85.00 2565.84 317.66
2017-05-02 15:00:00 73.36 84.93 2559.81 317.66
2017-05-03 15:00:00 73.05 85.11 2594.92 317.66
2017-05-04 15:00:00 72.36 84.68 2585.10 317.66
2017-05-05 15:00:00 72.51 83.41 2586.47 317.66
2017-05-08 15:00:00 72.51 84.59 2516.57 317.66
2017-05-09 15:00:00 72.47 85.11 2541.11 317.66
2017-05-10 15:00:00 73.40 89.36 2547.88 317.66
2017-05-11 15:00:00 74.33 89.07 2560.74 317.66
2017-05-12 15:00:00 78.49 91.72 2568.88 317.66
2017-05-15 15:00:00 79.80 91.65 2606.84 317.66
2017-05-16 15:00:00 79.11 92.05 2673.14 317.66
2017-05-17 15:00:00 77.91 90.53 2650.46 317.66
print('mean:','\n',data.mean(axis=0))
print('std:','\n',data.std(axis=0))
print(data.describe())
print(onebegin.head(5))
mean:
600036.XSHG 88.889860
601318.XSHG 104.962372
600519.XSHG 2823.996047
dtype: float64
std:
600036.XSHG 15.039725
601318.XSHG 22.705049
600519.XSHG 532.922368
dtype: float64
600036.XSHG 601318.XSHG 600519.XSHG
count 215.000000 215.000000 215.000000
mean 88.889860 104.962372 2823.996047
std 15.039725 22.705049 532.922368
min 69.230000 78.820000 2078.800000
25% 73.380000 81.390000 2397.590000
50% 87.370000 107.730000 2834.120000
75% 103.450000 120.980000 3053.000000
max 121.820000 169.570000 4467.820000
600036.XSHG 601318.XSHG 600519.XSHG
datetime
2017-01-03 15:00:00 1.000000 1.000000 1.000000
2017-01-04 15:00:00 1.001587 0.999750 1.051737
2017-01-05 15:00:00 1.007791 1.001627 1.037031
2017-01-06 15:00:00 1.000577 0.993616 1.048749
2017-01-09 15:00:00 0.998846 0.993616 1.041577