@Macux 2015-12-08T05:49:33.000000Z 字数 24840 阅读 1582

Python 数据结构

Python

1、NumPy

Python处理大数据的利器；

精通面向数组的编程和思维方式，是成为Python数据分析牛人的一大关键步骤。

import numpy as np

1.1 ndarray基础

ndarray从本质上讲，是一个通用的同构数据多维容器，其中所有的元素必须是相同类型。

每个数组都有'shape'和'dtype'属性。

投石问路

# 生成10个符合标准正态分布的随机数
data = np.random.randn(10).reshape(5,2)  
# 查看数组维度
data.shape
(5,2)
# 查看数组类型
data.dtype
dtype('float64')  # default type

数组与列表

(1)、数组“源于”列表，但又“高于”列表。

arr1 = [(1,2),(3,4)]    # 列表
arr2 = np.array([[1,2],[3,4]]) # 数组；最里层的中括号可以换成小括号；

(2)、在列表的基础上，创建ndarray是最佳的！

arr1 = [(1,2),(3,4)] 
arr2 = np.array(arr1)

(3)、创建完成后，修改数组的两个关键属性（shape&dtype）

ry = np.zeros(10) 
ry = ry.reshape(5,2)
ry = ry.astype('int')
print ry
array([[0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0]])

(4)、修改数组的值

# 强制修改
ry[1] = [1,2]
print ry
array([[0, 0],
       [1, 2],
       [0, 0],
       [0, 0],
       [0, 0]])

# 切片修改
an = np.arange(10)
an = an.astype('float') # 因为用np.arange()生成的数组，默认是int。
an = an.reshape(5,2)
anci = an[1:4]  # 产生切片。切片必须要有':'
anci[2] = np.random.randn(1,2)
# 如何用利用切片只修改一个值
arr = np.arange(10)
arr2 = arr[5:6]   # 反正就是要有':'咯～～
arr2[0] = 202

1.2 ndarray索引

布尔型索引

通过布尔型索引选取数组中的数据，总是创建数据的副本，即使一模一样。

names = np.array(['ryan','fan','peng','tera','data'])
value = np.random.randn(5,6)
msk = (names != 'peng') | (names = = 'tera')   # 它是一个bool型数组。还可以用'&'表示'且（和）'
value[msk]

Fancy indexing

利用整数数组进行索引。（化整为零后，就一目了然了！）

# 我更喜欢的Fancy indexing：
pyt = np.arange(32).reshape(8,4)
pyt[np.ix_([1,5,7,2],[1,2])]
# 选择pyt中的第2行、第6行、第8行和第3行，再选择这个子集中的第2和第3列。

# 结合切片
pyt[np.ix_([1,5,7,2],[0,3,2,1])]  = np.random.randn(4,4)

1.3 numpy和数据处理

numpy在数据处理方面的优势，在于它能够用简洁的数组表达式，来代替循环。

这种“矢量化数组运算”比等价的纯Python方式，至少快10～100倍。

np.where

官方解释：
When True, yield 'x', otherwise, yield 'y'.

mar = np.random.randn(5,5)
np.where(mar > 0,2,mar)  # 将所有的正值设置为2

数组统计

(1)、先解释下让人头晕的参数'axis'：

'axis'大部分情况下，只有'0'和'1'两种值。（偶尔会是'-1'和'-2'）

'axis = 0'，涉及不同列，在同一行操作。（default）

'axis = 1'，涉及不同行，在同一列操作。

(2)、小试牛刀

mill = np.random.rand(5,4)
# 可加入'axis'参数，来计算某一个轴上的统计值。缺失则默认所有轴。
mill.mean()   # 算数平均
mill.sum()    # 元素和
mill.std      # 标准差
mill.var      # 方差
mill.min      # 最小值
mill.max      # 最大值
mill.argmin   # 最小值索引
mill.argmax   # 最大值索引

排序

(1)、直接修改原始数组的排序

vinda = np.random.rand(10)
vinda.sort()  # 此表达式无任何返回值，它直接修改了原始数组。

(2)、产生副本的排序（推荐）

bind1 = np.sort(vinda,axis = -1)  # 同一行操作，类似于'axis = 1'。Default
bind2 = np.sort(vinda,axis = -2)  # 同一列操作，类似于'axis = 1'。

(3)、上述两种排序，都是默认升序，且无法修改为降序！囧么办呢？

vinda = np.random.rand(10)
vinda.sort()  # 此表达式无任何返回值，它直接修改了原始数组。
vinda[::-1]   # 若是二维数组，则是vinda[,::-1]

随机数的生成（R在这方面更强大，更方便！）

np.random.seed(10)  
np.random.shuffle(vinda)                       # 对序列就地随机排列，直接改变原序列。
np.random.rand(10,3)                           # 均匀分布的随机数。
np.random.randint(1,6,size = (10,20))          # 给定上下范围随机取任意维度的整数数组，但不能设置权重。
np.random.randn(10,3)                          # 标准正态分布的随机数。
np.random.normal(mean = k,std = t,size = ...)  # 一般正态的随机数。
...

2、pandas

pandas含有使数据分析工作变得更快更简单的高级数据结构和操作工具；

pandas是基于NumPy构建的，让以NumPy为中心的应用变得更简单。

import pandas as pd
from numpy import nan as NA

2.1 Series

Series是一种类似于一维数组的对象。

它由一组数据（各种NumPy数据类型）和一组与之相关的数据标签（即索引）组成。

它类似于R中的向量。

创建Series

(1)、最常用的方法

obj = pd.Series([3,4,5,6,NA],index = ['a','b','v','r','p'])

(2)、从字典转换

sta = {'a':3,'b':4,'v':5,'r':6}
obj = pd.Series(sta)
# 字典的key自动转换为index.

(3)、修改Series属性

# 修改Series的值
obj['p'] = 77  # obj中的NA值被替换为77
# 修改索引
obj.index = ['e','d','c','b','a']

用unique()输出取值情况

Return array of unique values in the Series. Significantly faster than numpy.unique.

kobe = np.arange(0,5)
bryant = np.array(['q','w','w','r','t'])
lakers = pd.DataFrame({'ryan' : kobe,'fan' : bryant}
                      ,index = ['ww','we','wq','wr','wt']
                      ,columns= ['ryan','fan'])
lakers.ix[:,1].unique()
array(['q', 'w', 'r', 't'], dtype=object)

检测缺失数据

pd.isnull(obj)
a    False
b    False
v    False
r    False
p     True
dtype: bool
# 返回的是一个Series
pd.notnull(obj)
a     True
b     True
v     True
r     True
p    False
dtype: bool

2.2 DataFrame

DataFrame是一个表格型的数据结构；

它可以被看做由Series组成的字典（共用'行索引'）；

从本质上讲，DataFrame只由value和index组成，并没有所谓的“字段”或者“变量”这种东西。

只是为了方便理解，所以可以将DataFrame的“列索引”理解为数据库里的“字段”。

创建DataFrame

(1)、我认为最清晰的创建方式

# 先创建数组（类似在R中先创建向量）
kobe = np.arange(0,5)
bryant = np.array(['q','w','e','r','t'])
# 调用pd.DataFrame()函数
lakers = pd.DataFrame({'ryan' : kobe,'fan' : bryant}
                      ,index = ['ww','we','wq','wr','wt']
                      ,columns= ['ryan','fan'])
    ryan fan
ww     0   q
we     1   w
wq     2   e
wr     3   r
wt     4   t
[5 rows x 2 columns]

(2)、DataFrame的引用

# 稳妥地引用特定列
lakers['ryan']
ww    0
we    1
wq    2
wr    3
wt    4
Name: ryan, dtype: int3
# 复杂引用
lakers.ix[['ww','we'],['ryan']]
    ryan
ww     0
we     1
[2 rows x 1 columns]

(3)、DataFrame高端引用

如何有条件的引用数据集？比如我只想看'brand'字段为1的那些数据。

在DataBase中，很容易实现，在Python中，这是一大难题。

还好有'Wes McKinney'这位大牛！

不过还是没有R中是subset()好用！

# 读取数据
data = pd.read_csv('zaiwang_train.csv')  
# 用'groupby'分好组后，做成'字典'
data_dic = dict(list(data.groupby('brand'))) 
# 由于brand的值是int，所以不用加引号！
data_dic[2]

(4)、修改DataFrame中的值

df = pd.DataFrame({'key1' : ['a','a','b','b','a'],
                   'key2' : ['one','two','one','two','one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})
df.ix[[0,1],['data1','data2']] = [100,99]

(5)、优雅地为DataFrame添加新的行、新的列

peo = pd.DataFrame(np.random.randn(5,5)
                   ,columns=['aa','b','ccc','d','ee']
                   ,index=['Joe','Steve','Wes','Jim','Travis'])
# 优雅地添加新的列
 peo['ff'] = peo['aa'] + peo['b']
 peo['gg'] = np.random.randn(5)
 # 优雅地添加新的行（经过上面的添加，现在peo已经有7列了）
 peo.ix['ryan',:] = np.arange(7)
 peo.ix['fan',:] = peo.ix['Joe',:] + peo.ix['Wes',:]

2.3 索引

Series和DataFrame中的索引是不可修改的，一旦创建，最多只能改一下排序。

类似于字典中的key。

重新索引

lakers2 = lakers.reindex(['wt','ww','wr','we','wq','ya'],fill_value = 'wen')
    ryan fan
wt     4   t
ww     0   q
wr     3   r
we     1   w
wq     2   e
wen   ya  ya
[6 rows x 2 columns]

利用索引在Series和DataFrame中进行删除操作

obj.drop('v')
lakers.drop(['wr','we'])           # 删除行
lakers.drop(['ryan'],axis = 1)     # 删除列

灵活运算

Pandas有一个优良的特性：在算数运算中自动对齐不同索引的数据。

(1)、自己跟自己玩

ope = pd.Series([7,8,9,10],index = ['a','r','v','q'])
opw = pd.Series([17,18,19,11],index = ['a','r','l','q'])
ope + opw
a    24
l   NaN
q    21
r    26
v   NaN
dtype: float64
# 产生NA值，是不可取的！
ope.add(opw,fill_value = 0)   #用0填充NA
a    24
l    19
q    21
r    26
v     9
dtype: float64

在DataFrame中也是一样，就不举例了!还有其它的灵活运算方法：
(1)、sub：“减”
(2)、 div：“除”
(3)、mul：“乘”

(2)、开始跟别人玩

默认情况下，DataFrame和Series之间的运算，是将Series的index匹配到DataFrame的列，然后沿着行一直向下广播。

bla = pd.DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),
                   index = ['ryan','fan','peng','jbl'])
whi = pd.Series([1,2,3,4],index= ['ryan','fan','peng','jbl'])
blu = pd.Series([2,3,4],index = list('bde'))
ultra = bla.sub(whi,axis = 0)    # 修改axis的默认值
fomat = lambda x :  "%.2f" % x   # 用来限制小数点位数
musi = ultra.applymap(fomat)
          b     d     e
ryan  -1.00  0.00  1.00
fan    1.00  2.00  3.00
peng   3.00  4.00  5.00
jbl    5.00  6.00  7.00
[4 rows x 3 columns]
bla.mul(blu)
       b   d   e
ryan   0   3   8
fan    6  12  20
peng  12  21  32
jbl   18  30  44
[4 rows x 3 columns]

排序

(1)、按索引排序

# 默认是根据行索引排序，升序排序。
obj.sort_index(axis = 0,ascending = True)  # obj既可以为Series，也可以为DataFrame。

(2)、按值排序

# 对Series进行“值”排序
Ser.order(na_last=True, ascending=True)  # 默认是将NA值放到最后，升序排序。
# 对DataFrame进行“值”排序
DF.sort_index(by = 'k')  # 'k' is Column name(s) in DataFrame.

排名

fram = pd.DataFrame({'b':[4,7,-3,2],'a':[-2,4,5,10],'c':[0.1,2.3,-1.9,22]}
                     ,index=['d','a','r','y'])
fram.rank(ascending = False)  # 更喜欢降序排序，数字越小排名越高，符合日常习惯。
   a  b  c
d  4  2  3
a  3  1  2
r  2  4  4
y  1  3  1
[4 rows x 3 columns]
fram.rank(ascending = False,axis =1)  
   a  b  c
d  3  1  2
a  2  1  3
r  1  3  2
y  2  3  1
[4 rows x 3 columns]

重复索引

pandas 可以允许索引具有重复值。

len = pd.Series(np.arange(3),index = ['a','a','b'])
len.index.is_unique
False

描述与汇总统计

值得注意的地方：它不允许有重复的index。

# 记住这一个函数就够了
obj.describe()     # Series汇总统计
count    4.000000
mean     4.500000
std      1.290994
min      3.000000
25%      3.750000
50%      4.500000
75%      5.250000
max      6.000000
dtype: float64
fram.describe()   # DataFrame汇总统计
         a         b          c
count   4.000000  4.000000   4.000000
mean    4.250000  2.500000   5.625000
std     4.924429  4.203173  11.050603
min    -2.000000 -3.000000  -1.900000
25%     2.500000  0.750000  -0.400000
50%     4.500000  3.000000   1.200000
75%     6.250000  4.750000   7.225000
max    10.000000  7.000000  22.000000
[8 rows x 3 columns]
## Q:当有分组变量时，怎么分组describe()？？
## A:"GroupBy" Object

isin() 、value_counts() 、count()

(1)、isin()

msk_zai = zaiwang.isin([0])

(2)、value_counts() 和 count() 的区别

vin = pd.Series([1,3,4,3],index = ['a','f','e','w'])

value_counts(values, sort=True, ascending=False, normalize=False, bins=None)
Compute a histogram of the counts of non-null values
Parameters
----------
values :    ndarray (1-d)
sort :      boolean, default True
            Sort by values
ascending : boolean, default False
            Sort in ascending order
normalize:  boolean, default False
            If True then compute a relative histogram
bins :      integer, optional
            Rather than count values, group them into half-open bins,convenience for pd.cut, only works with numeric data
Returns
-------
value_counts : Series
vin.value_counts()
3    2
1    1
4    1
dtype: int64
(lakers.ix[:,0] > 3).value_counts()
False    4
True     1
dtype: int64

count(self, axis=0, level=None) 
Return Series with number of non-NA/null observations over requested axis.
Parameters
----------
axis :         {0, 1}
               0 for row-wise, 1 for column-wise
level :        int, default None
               If the axis is a MultiIndex (hierarchical), count along a
               particular level, collapsing into a DataFrame
Returns
-------
count : Series (or DataFrame if level specified)
vin.count()
4

处理缺失值

(1)、删除缺失

dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False) 
Parameters
----------
axis :   {0, 1}, or tuple/list thereof
         Pass tuple or list to drop on multiple axes
how :    {'any', 'all'}
         * any : if any NA values are present, drop that label
         * all : if all values are NA, drop that label
thresh : int, default None
         int value : require that many non-NA values
subset : array-like
         Labels along other axis to consider, 
         e.g. if you are dropping rows these would be a list of columns to include
inplace: boolean, defalt False
         If True, do operation inplace and return None.

(2)、滤除缺失

data[data.notnull()]
data[data.isnull()]

(3)、填充缺失

fillna(self, value=None, method=None, axis=0, inplace=False, limit=None)
Parameters
----------
method :{'bfill', 'ffill'}, 
        ffill: propagate last valid observation forward to next valid.(default)
        bfill: use NEXT valid observation to fill gap.
value : scalar, dict, or Series
        Value to use to fill holes (e.g. 0), alternately a dict/Series of
        values specifying which value to use for each index (for a Series) or
        column (for a DataFrame). (values not in the dict/Series will not be
        filled). This value cannot be a list.
axis :  {0, 1}, default 0
        * 0: fill column-by-column
        * 1: fill row-by-row
inplace : boolean, default False
        If True, fill in place. Note: this will modify any
        other views on this object, (e.g. a no-copy slice for a column in a
        DataFrame).
limit : int, default None
        Maximum size gap to forward or backward fill

3、数据读入和写出

3.1 数据读入

# 保证数据文件和脚本文件在同一目录较方便。
train = pd.read_csv('train_fs.csv')

3.2 数据写出

train.to_csv('train_fs_w.csv')

4、数据规整化

4.1 合并数据集

pandas中有'merge'、'join'和'concat'三种合并DataFrame的方式；

'merge'和'join'是根据数据库的风格来合并，用处不大。
（用python做database的事，还做得不够专业，正常人都会选database。）

更常用的是'concat'，它和R中的'rbind'和'cbind'有异曲同工之妙。

简单合并

pd.concat([dat1,dat2],axis = 0)  # default。类似于'rbind()'
pd.concat([dat1,dat2],axis = 1)  # 类似于'cbind()'

填充合并

fra1= pd.DataFrame({'b':[4,4,-3,NA],'a':[-2,4,NA,10],
                    'c':[NA,2.3,-1.9,22]})
fra2 = pd.DataFrame({'a':np.random.randn(4),'b':np.random.chisquare(4),
                     'c':np.random.normal(loc=3,scale=9,size=4)})
fra1.combine_first(fra2) 
           a         b          c
0  -2.000000  4.000000  -0.038691
1   4.000000  4.000000   2.300000
2   1.308473 -3.000000  -1.900000
3  10.000000  4.477128  22.000000
[4 rows x 3 columns]

Python中的COALESCE()

fra3 = np.where(pd.isnull(fra1),fra2,fra1)
array([[ -2.        ,   4.        ,  -0.03869103],
       [  4.        ,   4.        ,   2.3       ],
       [  1.30847308,  -3.        ,  -1.9       ],
       [ 10.        ,   4.47712772,  22.        ]])
# 这么难看的ndarray，简直让人无法接受！@_@
fra4 = pd.DataFrame(fra3,columns=['a','b','c'])
fomat =  lambda x :  "%.2f" % x  
fra4.applymap(fomat)
       a      b      c
0  -2.00   4.00  -0.04
1   4.00   4.00   2.30
2   1.31  -3.00  -1.90
3  10.00   4.48  22.00
[4 rows x 3 columns]
>>>

4.2 数据转换

重塑和轴向转换

用得较多的情形是：当行索引是层次化索引时，用unstack()让其更漂亮！

# 举一个GroupBy的例子
result = tip.groupby('smoker')['tip_pct'].describe()   # 结果中，行索引是层次化索引，很难看!
# 用unstack()让它更漂亮
result.unstack('smoker')

移除重复数据

drop_duplicates(self, cols=None, take_last=False, inplace=False) 
Parameters
----------
cols:       column label or sequence of labels, optional
            Only consider certain columns for identifying duplicates, 
            by default use all of the columns.
take_last:  boolean, default False
            Take the last observed row in a row. Defaults to the first row
inplace:    boolean, default False
            Whether to drop duplicates in place or to return a copy

替换

vb1 = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,
                   'k2':[1,2,2,2,1,1,4]})
vb2 = vb1.replace(['one',2],['ryan',3])
     k1  k2
0  ryan   1
1  ryan   3
2  ryan   3
3   two   3
4   two   1
5   two   1
6   two   4
[7 rows x 2 columns]

离散化和面元划分

为了便于分析，有时会将连续数据离散化为“面元（bin）”；

介绍两个函数：

pd.cut() ：依据最值划分等长面元（每个区间的长度相等）；

pd.qcut() ：依据样本分位数划分等量面元（每个区间的数据量相等）；

用得较多的情形是：将数据分成k组；

np.random.seed(100)
data1 = np.random.rand(100)
vin = pd.cut(data1,5)
pd.value_counts(vin)
(0.202, 0.4]        28
(0.795, 0.992]      21
(0.00373, 0.202]    21
(0.597, 0.795]      16
(0.4, 0.597]        14
dtype: int64

vinq = pd.qcut(data1,5)
pd.value_counts(vinq)
(0.34, 0.576]       20
[0.00472, 0.197]    20
(0.576, 0.798]      20
(0.197, 0.34]       20
(0.798, 0.992]      20
dtype: int64

指标矩阵

(1)、将“分类变量”转换为“指标矩阵”，在数据预处理中，十分重要。

brand_new = pd.get_dummies(zaiwang['brand'])  
brand_new = brand_new.rename({1:'全球通',2:'神州行',3:'动感地带'})  
# 先删除原来的brand字段
del zaiwang['brand']   # 跟SQL语句很相似。永久删除。
# 添加新的'brand'字段，被分成了三个字段进行添加。
zaiwang = pd.concat([zaiwang,brand_new],axis = 1) # 按列合并

(2)、对变量离散化后，再转换为指标矩阵

先用某个合适的离散化算法，找出合适的"bin"；

再生成稀疏矩阵；

gh = np.random.randn(1000)
bins = [-2.855,-0.685,-0.00124,0.639, 2.804]
pd.get_dummies(pd.cut(gh,bins)).head()   # 输出前面的5行
          (-0.00124, 0.639]  (-0.685, -0.00124]  (-2.855, -0.685]  (0.639, 2.804]
0                  0                   0                 0               1
1                  0                   1                 0               0
2                  0                   0                 1               0
3                  0                   1                 0               0
4                  0                   0                 1               0
[5 rows x 4 columns]
pd.get_dummies(pd.cut(gh,bins)).tail()   # 输出后面的5行
             (-0.00124, 0.639]  (-0.685, -0.00124]  (-2.855, -0.685]  (0.639, 2.804]
995                  0                   0                 0               1
996                  1                   0                 0               0
997                  1                   0                 0               0
998                  0                   0                 1               0
999                  1                   0                 0               0
[5 rows x 4 columns]

5、数据聚合和分组运算

5.1 GroupBy技术

分组键的形式，可以多种多样：
(1)、列表或数组，其长度与待分组的轴一样；
(2)、Dataframe的某个列；
(3)、字典或Series，给出待分组轴上的值与分组名之间的对应关系；
(4)、函数，用于处理轴索引或索引中的各个标签；

一般玩法（“将Dataframe的某个列作为分组键”）

np.random.seed(88)
df = pd.DataFrame({'key1' : ['a','a','b','b','a'],
                   'key2' : ['one','two','one','two','one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})
grouped_A = df['data1'].groupby(df['key1'])   # 返回Series

我更喜欢的创建方法

# 借助语法糖(Syntactic Sugar)
grouped_A = df.groupby(df['key1'])['data1']     # 返回Series
grouped_B = df.groupby(df['key1'])[['data1']]   # 返回DataFrame
# 有两个地方需要注意：
## 1、当被分组对象超过一列时，返回的必然是“DataFrame”；
## 2、当“['data1']”或“[['data1']]”缺省时，默认是按当前分组键，对所有字段进行分组，且自动剔除其它分类字段。

现在的“grouped_A”和“grouped_B”都还只是一个“GroupBy”对象，还没有进行任何计算；

相当于刚完成“Splite”环节；

看看“GroupBy”到底有多强：

(0)、最偷懒、最常用的用法：

group_by.describe().unstack()

(1)、从基本的功能开始：

reilly = df[['data1']].groupby([df['key1'],df['key2']]).mean()
              data1
key1 key2          
a    one   0.587699
     two   2.205815
b    one   0.956563
     two   0.068411
[4 rows x 1 columns]
reilly.unstack()    # unstack()专治层次化索引一百年！
              data1          
key2       one       two
key1                    
a     0.587699  2.205815
b     0.956563  0.068411
# 内层index变“字段”，外层index变“index”。

(2)、原数据集中没有分类键，肿么办？（将数组作为分组键，对Record进行分组）

# 用数组构建分类变量
states = np.array(['Ohio','California','California','Ohio','Ohio'])
years = np.array([2014,2014,2015,2014,2015])
# 将分类变量按照数据集的顺序放入其中，作为新的两列。
# 这种自己新建分类键的方式，让GroupBy操作更加灵活！
df.groupby([states,years])['data1'].mean()
California  2014    2.205815
            2015    0.956563
Ohio        2014    0.087648
            2015    1.068514
Name: data1, dtype: float64
# 旋转仅仅是为了好看！
df.groupby([states,years])['data1'].mean().unstack()
                2014      2015
California  2.205815  0.956563
Ohio        0.087648  1.068514
[2 rows x 2 columns]

(3)、如何计算复合分组的频数（Python优势）

# 前面也提到，用pd.value_counts()来计算某一列的值的分布情况。
# 但是它无法统计复合分组的频数。
tip.groupby(['sex','smoker']).size().unstack()
smoker  No  Yes
sex            
Female  54   33
Male    97   60

(4)、如何根据分类变量的“值”去筛选DataFrame（Python软肋）

# 将“GroupBy”对象做成一个字典
pie = dict(list(df.groupby('key1')))
pie['b']
      data1     data2 key1 key2
2  0.956563  0.730430    b  one
3  0.068411 -0.171214    b  two
[2 rows x 4 columns]

如果用Ｒ来写，更加简洁：

subset(df,subset = (key1 == 'b'))

(5)、根据Series或字典进行分组（对字段进行分组）

# 根据字典分组
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_columns = peo.groupby(mapping,axis=1)
by_columns.sum()
            blue       red
Joe    -1.554133 -0.040501
Steve  -0.277024 -0.747329
Wes     0.552219       NaN
Jim    -1.522006  0.885805
Travis  0.101749  0.722256
[5 rows x 2 columns]

# 根据Series分组
map_series = pd.Series(mapping)
peo.groupby(map_series,axis=1).count()
         blue  red
Joe        1    1
Steve      1    1
Wes        1    0
Jim        1    1
Travis     1    1
[5 rows x 2 columns]

(6)、根据函数进行分组

任何被当做分组键的函数都会在各个索引值上被调用一次，其返回值被用作分组名称；

传入函数的实参，只能是“索引”。（行索引或列索引）

peo.groupby(len).count()
   aa  b  ccc  d  ee
3   3  2    2  3   3
5   1  1    1  1   1
6   1  1    1  1   1
[3 rows x 5 columns]

任何作为分组键的对象，最终都会被转换为“数组”！！！

key_list = ['one','one','one','two','two']
peo.groupby([len,key_list]).min()
             aa         b       ccc         d        ee
3 one   -0.253221  -0.040501 -0.903721 -1.554133 -0.951133
  two   -1.427493   0.885805  2.360634 -1.522006 -0.215945
5 one    0.200526  -0.747329  1.068081 -0.277024  0.086557
6 two    0.190327   0.722256 -0.870716  0.101749  0.555358
[4 rows x 5 columns]

(7)、实用性最高的分组方法

bins = [-2.855,-0.685,-0.00124,0.639, 2.804]
factors = pd.cut(frame['data1'],bins)
grp_cut = frame.groupby(factors)['data1']

5.2 数据聚合

面向列的多函数应用

情景：已经对数据集做好分组，但想对“相同的字段使用不同的函数”或者“不同的字段使用不同的函数”。

要想实现多函数应用，从agg()函数开始：
①、当使用自定义函数来处理GroupBy对象时，必须将其放在agg()内；
②、当要同时传入多个function时，即使是封装好的函数，也要放入agg()内；
③、最好将函数集合，以list的形式传入agg()；

(1)、思路

①、创建groupby对象；
②、创建“function”列表；
③、用agg()进行统计计算，并把结果赋给新的变量；
④、用新变量进行计算结果的漂亮展示；

(2)、基本玩法

functions =['mean','count','max']
df.groupby('key1').agg(functions)
                   data1                           data2                 
          mean     count       max      mean       count       max
key1                                                      
a     1.127071      3       2.205815  -0.407685      3      0.997183
b     0.512487      2       0.956563   0.279608      2      0.730430
[2 rows x 6 columns]

(3)、用类似DataBase中的“as”功能，进行重命名。

tip['tip_pct'] = tip['tip'] / tip['total_bill']
# 传入一个是由(name, function)元组构成的列表，类似'as'。
group_pct = tip.groupby(['sex','smoker'])[['tip_pct','total_bill']]
group_pct.agg([('foo','mean'),('bar','std')])
                       tip_pct            total_bill          
                  foo       bar         foo       bar
sex    smoker                                          
Female No      0.156921  0.036421   18.105185  7.286455
       Yes     0.182150  0.071595   17.977879  9.189751
Male   No      0.160669  0.041849   19.791237  8.726566
       Yes     0.152771  0.090588   22.284500  9.911845

(4)、相同的Column，多个Function

group_pct.agg(functions)
                     tip_pct                   total_bill              
                   mean      count       max        mean     count    max
sex    smoker                                                     
Female No        0.156921     54      0.252672   18.105185     54    35.83
       Yes       0.182150     33      0.416667   17.977879     33    44.30
Male   No        0.160669     97      0.291990   19.791237     97    48.33
       Yes       0.152771     60      0.710345   22.284500     60    50.81
[4 rows x 6 columns]

(5)、不同的Column，不同的Function

gro = tip.groupby(['sex','smoker'])
gro.agg({'tip' : ['min','max',np.sum,'std'],'total_bill' : 'sum'})
                   total_bill          tip                        
                      sum   min   max     sum       std
sex    smoker                                          
Female No          977.68  1.00   5.2   149.77   1.128425
       Yes         593.27  1.00   6.5    96.74   1.219916
Male   No         1919.75  1.25   9.0   302.00   1.489559
       Yes        1337.07  1.00  10.0   183.07   1.500120
[4 rows x 5 columns]

以“无索引”的形式返回聚合数据

到目前为止，所有的聚合数据都有由唯一的分组键组成的索引；

其实我们更想要的是，分组键还是作为列，行索引用数字表示；

tip.groupby(['sex','smoker'],as_index = False).mean()
      sex   smoker  total_bill       tip      size    tip_pct
0  Female     No    18.105185     2.773519  2.592593  0.156921
1  Female    Yes    17.977879     2.931515  2.242424  0.182150
2    Male     No    19.791237     3.113402  2.711340  0.160669
3    Male    Yes    22.284500     3.051167  2.500000  0.152771
[4 rows x 7 columns]

"as_index = False" is relevant for DataFrame；

"as_index = False" is effective for 'SQL-Style' grouped output；

这种设置让返回结果更接近于“一张table”；

5.3 分组级运算和转换

如何将一个函数应用到各个分组，且让结果处在适当的位置上？

key_list = ['one','one','one','two','two']
grot1 = peo.groupby(key_list).transform(np.mean)
              aa         b       ccc         d        ee
Jim       -0.618583  0.804030  0.744959 -0.710128  0.169707
Joe        0.336247 -0.393915  0.082180 -0.426313 -0.444106
Steve      0.336247 -0.393915  0.082180 -0.426313 -0.444106
Travis    -0.618583  0.804030  0.744959 -0.710128  0.169707
Wes        0.336247 -0.393915  0.082180 -0.426313 -0.444106
[5 rows x 5 columns]

暂时还不知道有啥用处？ -_-

apply：一般性的“拆分--应用--合并”

(1)、前言

上面提到的agg()和transform()，使用它们都有着严格的条件：传入的函数只能返回“一个可以广播的标量值”或者“相同大小的结果数组”；

(2)、apply的魅力

能发挥apply()多大的威力，完全取决于使用者的创造力；

传入什么function全由使用者决定，只要返回一个pandas对象或标量值即可；

(3)、禁止分组键，让结果更美观。

有时，分组键会跟原始对象的行索引共同构成结果对象中的层次化索引；

只有当传入函数返回的是相同大小的结果数组，才有用。比如：

# top函数，只做了排序处理，不做任何聚合计算。
# 将top()传给agg()和transform()是会出错的！
def top(df,n = 5, column = 'tip_pct'):
    return df.sort_index(by=column)[-n:]    
tip.groupby(['smoker']).apply(top)
            total_bill   tip     sex smoker   day    time    size   tip_pct
smoker                                                                   
No     88        24.71  5.85    Male     No  Thur    Lunch      2   0.236746
       185       20.69  5.00    Male     No   Sun    Dinner     5   0.241663
       51        10.29  2.60  Female     No   Sun    Dinner     2   0.252672
       149        7.51  2.00    Male     No  Thur    Lunch      2   0.266312
       232       11.61  3.39    Male     No   Sat    Dinner     2   0.291990
Yes    109       14.31  4.00  Female    Yes   Sat    Dinner     2   0.279525
       183       23.17  6.50    Male    Yes   Sun    Dinner     4   0.280535
       67         3.07  1.00  Female    Yes   Sat    Dinner     1   0.325733
       178        9.60  4.00  Female    Yes   Sun    Dinner     2   0.416667
       172        7.25  5.15    Male    Yes   Sun    Dinner     2   0.710345
[10 rows x 8 columns]
# 禁止分组键后的效果
tip.groupby('smoker',group_keys = False).apply(top)
       total_bill   tip     sex   smoker   day     time    size   tip_pct
88        24.71    5.85    Male     No    Thur     Lunch     2    0.236746
185       20.69    5.00    Male     No    Sun     Dinner     5    0.241663
51        10.29    2.60  Female     No    Sun     Dinner     2    0.252672
149        7.51    2.00    Male     No    Thur     Lunch     2    0.266312
232       11.61    3.39    Male     No    Sat     Dinner     2    0.291990
109       14.31    4.00  Female    Yes    Sat     Dinner     2    0.279525
183       23.17    6.50    Male    Yes    Sun     Dinner     4    0.280535
67         3.07    1.00  Female    Yes    Sat     Dinner     1    0.325733
178        9.60    4.00  Female    Yes    Sun     Dinner     2    0.416667
172        7.25    5.15    Male    Yes    Sun     Dinner     2    0.710345
[10 rows x 8 columns]

(4)、十分有用的桶分析（分位数分析）

np.random.seed(313)
tera = pd.DataFrame({'data1': np.random.randn(1000),
                     'data2': np.random.randn(1000)})
factor_cut = pd.cut(tera['data1'],4)
factor_qcut = pd.qcut(tera['data1'],10,labels= False)
grp_cut = tera.groupby(factor_cut)['data2']
grp_qcut = tera.groupby(factor_qcut)['data2']
 grp_cut.agg(['mean','count','min','max'])
                      mean       count       min       max
data1                                                
(-3.377, -1.856]     0.037777     38     -1.807103  1.685104
(-1.856, -0.342]    -0.025712    334     -2.522064  2.932612
(-0.342, 1.173]     -0.084292    486     -3.034921  2.480206
(1.173, 2.687]      -0.002127    142     -3.354899  2.150777
[4 rows x 4 columns]
grp_qcut.agg(['mean','count','min','max'])
            mean    count       min       max
0       -0.067625    100    -2.286255  1.685104
1       -0.002776    100    -2.203194  2.153888
2       -0.087955    100    -2.522064  2.614345
3        0.076329    100    -2.336891  2.932612
4       -0.095997    100    -1.676822  1.920706
5       -0.067538    100    -2.365760  2.402073
6        0.002284    100    -2.558904  2.265973
7       -0.267820    100    -3.034921  2.480206
8       -0.006690    100    -3.354899  2.461943
9        0.033588    100    -3.094111  2.150777
[10 rows x 4 columns]

为什么十分有用：
对于int类型的字段，可以根据业务经验设置“面元”，对数据进行分组统计。

# 怎么用：
bins = [-2.855,-0.685,-0.00124,0.639, 2.804]
factors = pd.cut(frame['data1'],bins)
grp_cut = frame.groupby(factors)['data1']
grp_cut.describe().unstack()

5.4 经典案例

对来自不同分组的缺失值进行分组填充

np.random.seed(528)
yan = np.random.randn(8)
yan[::2] = np.nan   # '::'后面的数字表示步长，每隔K元素取一次。
sta= np.array(['Ohio','California','New York','Vermont','Florida','Oregon','Nevada','Idaho'])
ya = pd.Series(yan,index=sta)
# 借用列表，进行简便生成分组键。
gr_keys = ['East'] * 4 + ['West'] * 4       
# 借用数组，进行简便生成分组键。
r_keys_r = np.array(['East','West']).repeat([4,4])
ryan = ya.groupby(gr_keys)
fill_mean = lambda g: g.fillna(g.mean())      #'g'是形参，调用时传入的分好租的Series。
fill_value = {'East': 0.931,'West': 0.3415}   # 在代码中预定义各组的填充值。
fill_func = lambda g: g.fillna(fill_value[g.name])
ryan.apply(fill_mean)
Ohio         -0.138512
California   -0.766740
New York     -0.138512
Vermont       0.489717
Florida      -0.554765
Oregon       -0.613530
Nevada       -0.554765
Idaho        -0.496000
dtype: float64
ryan.apply(fill_func)
Ohio          0.931000
California   -0.766740
New York      0.931000
Vermont       0.489717
Florida       0.341500
Oregon       -0.613530
Nevada        0.341500
Idaho        -0.496000
dtype: float64

5.5 透视表与交叉表

透视表（Pivot Table）

根据一个或多个键，对数据进行聚合；

并根据行和列上的分组键将数据分配到各个矩形区域中；

tip.pivot_table(['tip_pct','size'],rows = ['sex','day'],cols = 'smoker',margins = True)
                         tip_pct                     size                    
smoker             No       Yes       All        No       Yes       All
sex    day                                                             
Female Fri      0.165296  0.209129  0.199388  2.500000  2.000000  2.111111
       Sat      0.147993  0.163817  0.156470  2.307692  2.200000  2.250000
       Sun      0.165710  0.237075  0.181569  3.071429  2.500000  2.944444
       Thur     0.155971  0.163073  0.157525  2.480000  2.428571  2.468750
Male   Fri      0.138005  0.144730  0.143385  2.000000  2.125000  2.100000
       Sat      0.162132  0.139067  0.151577  2.656250  2.629630  2.644068
       Sun      0.158291  0.173964  0.162344  2.883721  2.600000  2.810345
       Thur     0.165706  0.164417  0.165276  2.500000  2.300000  2.433333
All             0.159328  0.163196  0.160803  2.668874  2.408602  2.569672
[9 rows x 6 columns]

pivot_table(data, values=None, rows=None, cols=None,aggfunc='mean'
            ,fill_value=None, margins=False, dropna=True)
Parameters
----------
data   : DataFrame
values : 待聚合的列的名称，默认是聚合所有数值列。
rows   : 用于分组的列名或其它分组键，出现在结果透视表的行。
cols   : 用于分组的列名或其它分组键，出现在结果透视表的列。
aggfunc: 聚合函数或函数列表，默认是'mean'。可以是任何对GroupBy有效的函数。
margins: 添加行和列的小计和总计，默认是False。

交叉表

用于计算分组频率的特殊透视表；

传入的对象，最好是factor；

pd.crosstab(tip['smoker'],tip['sex'],margins = True)
sex     Female  Male  All
smoker                   
No          54    97  151
Yes         33    60   93
All         87   157  244
[3 rows x 3 columns]

pd.crosstab(rows, cols, values=None, rownames=None, colnames=None
            ,aggfunc=None, margins=False, dropna=True)
Parameters
----------
rows:      array-like, Series, or list of arrays/Series
           Values to group by in the rows
cols:      array-like, Series, or list of arrays/Series
           Values to group by in the columns
values:    array-like, optional
           Array of values to aggregate according to the factors
aggfunc:   function, optional
           If no values array is passed, computes a frequency table
rownames:  sequence, default None
           If passed, must match number of row arrays passed
colnames:  sequence, default None
           If passed, must match number of column arrays passed
margins:   boolean, default False
           Add row/column margins (subtotals)
dropna:    boolean, default True
           Do not include columns whose entries are all NaN

6、NumPy高级应用

6.1 ndarray对象的内部机理

ndarray提供了一种将同质数据块解释为多维数组对象的方式，数据类型(dtype)决定了数据的解释方式。

ndarray内部由以下内容组成：

一个指向数组的指针；

数据类型dtype；

一个表示数组形状的元组shape；

一个跨度元组；（它决定了数组视图不主动复制任何数据。可以用arr[5:8].copy()强制复制。）

6.2 高级数据操作

数据重塑

arr = np.arange(18)
arr.reshape(6,-1)  # 其中一维是'-1'时，表示该维度的大小由数据本身推断而来。

扁平化

含义：将任意多维数组，转换成一维数组；

转换后的shape永远是(n,)，即为“1 x n”；

ker = np.arange(6).reshape(3,2)
ker.ravel()         # 不产生源数据的副本；
ker.flatten()       # 产生源数据的副本；

C和Fortran顺序

NumPy允许更为灵活地控制数据在内存中的布局。

默认情况下，NumPy数组是按行优先顺序创建，这意味着一个二维数组，每行中的数据项是被存放在相邻内存位置上的；

C和Fortran顺序的关键区别，在于维度的行进顺序不同：

C（行优先顺序）：先经过更高的维度；（轴1会先于轴0被处理）

Fortran（列优先顺序）：后经过更高的维度；（轴0会先于轴1被处理）

数组的合并和拆分

# 此处只给出最容易记忆的“连接函数”：
terafill = np.arange(6)
tera = terafill.reshape(3,2)
fill = np.random.randn(3,2)
beats = np.r_[tera,fill]                        # 相当于R中的rbind()
urbeats = np.c_[np.r_[tera,fill],terafill]      # 相当于R中的cbind()

数组元素的重复操作

当传入的是list时，len(list)必须等于数组元素个数；

当同时还传入了“axis”参数时，“元素个数”是指制定轴方向上的个数；

(1)、reprat()

np.repeat(array, repeats, axis=None)
Parameters: 
-----------
array   : array_like
          Input array.
repeats : {int, array of ints}
          The number of repetitions for each element. 
          repeats is broadcasted to fit the shape of the given axis.
axis    : int, optional
          The axis along which to repeat values. 
          By default, use the flattened input array, and return a flattened output array.

(2)、tile()

np.tile(array,reps)
Parameters: 
-----------
array   : 数组
reps    : 当传入的是“标量”时，只能水平铺设；
          当传入的是“元组”时，第一个数字表示“垂直铺设”数量，第二个数字标识“水平铺设”数量。

6.3 广播（broadcasting）

定义：

指的是不同形状的数组之间的算术运算的执行方式；

是针对矢量化计算的强大手段；

原则：

后缘维度（从末尾开始算起的维度）的轴长度相符；

后缘维度的其中一方的后缘维度为“1”；

# 后缘维度（从末尾开始算起的维度）的轴长度相符；
hol = np.random.randn(4,3)
hol - hol.mean(0)
# 后缘维度的其中一方的后缘维度为“1”；
hol - hol.mean(1).reshape(4,1)

6.4 有关排序的话题

直接排序

svi = np.random.randn(7)
svi.sort()           # 就地排序，直接修改原数组，不会产生新数组。
np.sort(svi)         # 为原数组创建一个已排序的副本；

间接排序

给定一个或多个键，得到一个由证书组成的索引数组（“索引器”）；

索引器的值说明了数据在新顺序下的位置；

(1)、argsort()

valu = np.array([4,1,6,0,5])
indexer = valu.argsort()    # Return the indices that would sort an array.

(2)、lexsort()

# 特点是可以一次性对多个键执行间接排序。
fir = np.array(['Bob','Jane','Steve'])
sec = np.array(['Arnold','Jones','Walters'])
sorter = np.lexsort(fir,sec)   # The last cloumn is primary sort key.
zip(sec[sorter],fir[sorter]) 
# zip()是Python的一个内建函数，将对象中对应的元素打包成一个个tuple，然后返回由这些tuples组成的list。

稳健排序

当第一个键取值相同时，按第二个键“升序”排列。

trum = np.array(['2:first','2:second','1:second','1:third'])
kst = np.array([2,2,1,1,1])
ind = kst.argsort(kind = 'mergesort')
trum.take(ind)

np.searchsorted()

它在有序数组中执行二分查找的数组方法，只要将值插入到它返回的位置，就能保持数组的有序性。

它真正的魅力在于“巧妙地对原数据进行拆分”。

ipho = np.floor(np.random.uniform(0,1000,size = 50))
bins = np.array([0,100,1000,5000,10000])
labe = np.digitize(ipho,bins)           # 计算各数据点所属面元编号；
chin = Series(ipho).groupby(labe)       # 利用groupby对原数据集进行拆分；