@jk88876594 2017-07-01T11:45:15.000000Z 字数 5435 阅读 7227

Series的介绍和使用

阿雷边学边教python数据分析第3期——pandas与numpy

1.什么是Series

简单来说：一维带标签（索引）的数组
展开来说：由一组数据和一组与之相关的数据标签（即索引）组成的一维数组对象

索引	数组
a	1
b	2
c	4
d	8

2.Series的特征

（1）数组中的数据可以是任意的类型，包括整数、浮点数、字符串、列表、字典等python对象
（2）数组中的数据应为同一类型

3.创建Series

一般格式
s = pd.Series(data,index = index)

（1）通过列表list创建Series

import pandas as pd
import numpy as np
s = pd.Series([10,30,20,40])
s

0    10
1    20
2    40
3    30
dtype: int64

（2）通过字典dictionary创建Series

dict_1 = {"a":10,"c":5,"b":40}
s1 = pd.Series(dict_1)
s1

a    10
b    40
c     5
dtype: int64

拓展一下，字典和Series的区别：字典是无序的，Series是有序的，通过无序的字典来创建Series后，Series会将字典的键作为自己的索引，并且按升序方式排列

（3）通过数组array创建Series

array_1 = np.arange(10,16)
s2 = pd.Series(array_1,index=list("abcdef"))
s2

a    10
b    11
c    12
d    13
e    14
f    15
dtype: int32

4.Series的属性

（1）获得索引index

s2.index

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

（2）通过赋值整体地修改索引值

s2.index = ["aa","bb","cc","dd","eee","fff"]
s2

aa     10
bb     11
cc     12
dd     13
eee    14
fff    15
dtype: int32

要点：新设置的索引个数必须与原来索引个数一致
值得一提的是：提取单个索引值来修改是不允许的，如果要修改索引值，就得全部重新设置

（3）修改index的名称

s2.index.name = "banana"
s2

banana
aa     10
bb     11
cc     12
dd     13
eee    14
fff    15
dtype: int32

（4）修改Series的名称

s2.name = "length"
s2

banana
aa     10
bb     11
cc     12
dd     13
eee    14
fff    15
Name: length, dtype: int32

（5）获取Series的值values

s2.values

array([10, 11, 12, 13, 14, 15])

通过以上我们发现，Series对象本身及其索引都有name属性

5.Series的索引 index

（1）位置索引

#得到第一行的数
s2[0]

#得到最后一行的数
s2[-1]

#得到特定一些行的数(如第1行，第4行，第6行）
s2[[0,3,5]]

banana
aa     10
dd     13
fff    15
Name: length, dtype: int32

（2）名称索引

#得到索引为aa所对应的数
s2["aa"]

#得到特定一些索引所对应的数
s2[["aa","cc","fff"]]

banana
aa     10
cc     12
fff    15
Name: length, dtype: int32

（3）点索引

只能取一个数

通过点来获取Series的对象，因此如果索引名不恰当的话会与系统关键词重名，导致无法取数，故而在索引名与函数名重名时不推荐使用该方法

对象不重名的情况

s2.aa

对象重名的情况

s2.index = ["aa","bb","cc","dd","eee","def"]

s2.def

  File "<ipython-input-71-e31e21dd2a96>", line 2
    s2.def
         ^
SyntaxError: invalid syntax

#然而用位置索引和名称索引却可以成功
print(s2[5])
print(s2["def"])

15
15

1.一般情况下还是推荐用名称索引或者位置索引。
2.如果确保不发生对象名字冲突的话，可以用点索引，这种写法会稍微快那么一点点，但我还是觉得何必那么麻烦记那么多方法。因此推荐名称索引和位置索引。

6.Series的切片slice

（1）索引位置切片

s2[1:4]

banana
bb    11
cc    12
dd    13
Name: length, dtype: int32

没有包含末端

（2）索引名称切片

s2["aa":"eee"]

banana
aa     10
bb     11
cc     12
dd     13
eee    14
Name: length, dtype: int32

包含了末端

7.修改Series的值

s2[index] = value （index表示需要修改的值所对应的索引）

s2[i] = value （i表示需要修改的值所对应的索引位置）

s2["aa"] = 100
s2[2] = 120
s2

banana
aa     100
bb      11
cc     120
dd      13
eee     14
def     15
Name: length, dtype: int32

8.添加Series的值

返回一个新的Series，不修改原来的Series
s2.append(pd.Series([value1,value2,...],index = [index1,index2,...]))

直接在原来的基础上修改Series
s2["new index"] = value

#添加Series的值,并返回一个新的Series
s2.append(pd.Series([50,60],index=["a1","a2"]))

aa     100
bb      11
cc     120
dd      13
eee     14
def     15
a1      50
a2      60
dtype: int64

#添加Series的值,直接在原来的基础上修改Series
s2["y"] = 99
s2

banana
aa     100
bb      11
cc     120
dd      13
eee     14
def     15
y       99
Name: length, dtype: int64

通过append来添加Series的值，特点是：
1.返回一个新的Series
2.批量修改

通过s2["new index"] = value这种方式来添加的值，特点是:
1.直接在原来的Series基础上增加值
2.每次只能增加一个值

9.删除Series的值

del s2[index]

#删除y索引对应的99这个值
del s2["y"]
s2

banana
aa     100
bb      11
cc     120
dd      13
eee     14
def     15
Name: length, dtype: int64

10.过滤Series的值

通过布尔选择器（条件筛选）来过滤掉一些值，从而得到满足条件的值
s2[s2 < value]
s2[s2 > value]
s2[s2 == value]
s2[s2 != value]

#单条件筛选
s2[s2 > 90]

banana
aa    100
cc    120
Name: length, dtype: int64

s2[s2 == 13]

banana
dd    13
Name: length, dtype: int64

#多条件筛选
s2[(s2 > 50) | (s2 < 14)]

banana
aa    100
bb     11
cc    120
dd     13
Name: length, dtype: int64

11.Series的缺失值处理

#创建一个带有缺失值的Series
s = pd.Series([10,np.nan,15,19,None])
s

提示：None值会被当做NA处理

（1）判断是否有缺失值
isnull（）

#判断s中的缺失值
s.isnull()

0    False
1     True
2    False
3    False
4     True
dtype: bool

#如果需要取出这些缺失值，则通过布尔选择器来筛选出来
s[s.isnull()]

1   NaN
4   NaN
dtype: float64

（2）删除缺失值
dropna（）

#dropna()会删除掉所有缺失值NaN，并返回一个新的Series
s.dropna()

0    10.0
2    15.0
3    19.0
dtype: float64

#原有的Series并未发生改变
s

0    10.0
1     NaN
2    15.0
3    19.0
4     NaN
dtype: float64

#如果希望原有的Series发生改变，可以将s.dropna（）返回的新Series直接赋值给原来的Series
s = s.dropna()
s

0    10.0
2    15.0
3    19.0
dtype: float64

此外，我们也可以通过过滤的方式来达到一样的删除效果：
data[~data.isnull()]
data[data.notnull()]

s = pd.Series([10,np.nan,15,19,None]) #初始化一下s
s[~s.isnull()]  #依然是返回一个新的Series，波浪号~表示否定、非的意思

0    10.0
2    15.0
3    19.0
dtype: float64

#通过notnull（）也能实现，同样也是返回一个新的Series
s[s.notnull()]

0    10.0
2    15.0
3    19.0
dtype: float64

（3）填充缺失值
fillna（）
用指定值或插值的方式填充缺失值

用指定值填充缺失值

#用0填充缺失值,返回的依然是一个新的Series
s.fillna(value=0)

0    10.0
1     0.0
2    15.0
3    19.0
4     0.0
dtype: float64

#如果希望直接修改原Series，一种方法是之前说的直接赋值，另一种是添加参数inplace=True
s.fillna(value=0,inplace=True)

用插值填充缺失值

#初始化一下s
s = pd.Series([10,np.nan,15,19,None])
s

向前填充（ffill，全称是front fill）

s.fillna(method="ffill")

0    10.0
1    10.0
2    15.0
3    19.0
4    19.0
dtype: float64

向后填充（bfill，全称是back fill）

s.fillna(method="bfill")

0    10.0
1    15.0
2    15.0
3    19.0
4     NaN
dtype: float64

12.排序

#创建一个Series
s3 = pd.Series([10,15,8,4,20],index=list("gadkb"))
s3

g    10
a    15
d     8
k     4
b    20
dtype: int64

（1）根据索引排序
sort_index() 默认升序，如果添加参数ascending=False,则降序排列

#根据索引升序排列
s3.sort_index()

a    15
b    20
d     8
g    10
k     4
dtype: int64

#根据索引降序排列
s3.sort_index(ascending=False)

k     4
g    10
d     8
b    20
a    15
dtype: int64

（2）根据值排序
sort_values() 默认升序，如果添加参数ascending=False,则降序排列

#根据值升序排列
s3.sort_values()

k     4
d     8
g    10
a    15
b    20
dtype: int64

#根据值降序排列
s3.sort_values(ascending=False)

b    20
a    15
g    10
d     8
k     4
dtype: int64

13.排名

rank（）

#创建一个用来排名的Series
s4 = pd.Series([2,5,15,7,1,2])
s4

0     2
1     5
2    15
3     7
4     1
5     2
dtype: int64

中国式排名

s4.rank(ascending=False,method="dense")

0    4.0
1    3.0
2    1.0
3    2.0
4    5.0
5    4.0
dtype: float64

14.Series的描述性统计

#创建一个Series
s5 = pd.Series([100,50,100,75,24,100])
s5

值的计数 Series.value_counts（）

s5.value_counts()

100    3
75     1
50     1
24     1
dtype: int64

最小值 s5.min（）

s5.min()

最大值 s5.max（）

s5.max()

中位数 s5.median（）

s5.median()

87.5

均值 s5.mean（）

s5.mean()

74.83333333333333

求和 s5.sum（）

s5.sum()

标准差 s5.std()

s5.std()

31.940048006643114

描述性统计 s5.describe（）

s5.describe().round(1)

count      6.0
mean      74.8
std       31.9
min       24.0
25%       56.2
50%       87.5
75%      100.0
max      100.0
dtype: float64

15.Series的向量化运算

可对Series进行批量操作，并且返回一个新的Series，并不会在原基础上直接修改

s5 + 1000

0    1100
1    1050
2    1100
3    1075
4    1024
5    1100
dtype: int64

s5 - 2000

0   -1900
1   -1950
2   -1900
3   -1925
4   -1976
5   -1900
dtype: int64

s5 * 2

0    200
1    100
2    200
3    150
4     48
5    200
dtype: int64

s5 / 10

0    10.0
1     5.0
2    10.0
3     7.5
4     2.4
5    10.0
dtype: float64

自动对齐相同索引的数据,不同索引的数据对不上，则显示NaN

s6 = pd.Series([35000,40000,71000,5500],index=list("abcd"))
s7 = pd.Series([222,35000,4000,2222],index=list(aqtb))
s6 + s7

a    35222.0
b    42222.0
c        NaN
d        NaN
q        NaN
t        NaN
dtype: float64

第2期进度.png-376.7kB