Pandas 是 Python 第三方库,提供高性能易用数据类型和分析工具,Pandas 基于 NumPy 实现,常与 NumPy 和 Matplotlib 一同使用。
Reference:【MOOC】Python 数据分析与展示 - 北京理工大学 -【第三周】数据分析之概要 ; 公开课 ;Document ;GitHub
Pandas 库
一般通过 import pandas as pd
来引用 Pandas 库。
pd 通常为该模块的别名。
与 numpy 的区别:
基础数据类型
扩展数据类型
关注数据的结构表达
关注数据的应用表达
维度:数据间关系
数据与索引的关系
该库基于 numpy 提供了两个新的数据类型:Series, DataFrame。
基于上述数据类型有各类操作:基本操作、运算操作、特征类操作、关联类操作。
Series 类型(一维)
Series 类型由一组数据及与之相关的数据索引组成:
index_0 ------> data_a
index_1 ------> data_b
index_2 ------> data_c
index_3 ------> data_d
index_4 ------> data_e
索引 数据
Series 是一维带 “标签” 数组 (Series 类型包括 index 和 values 两部分,index 和 values 一一对应)
Series 类型的创建
Series 类型可以由如下类型创建:
- Python 列表,index 与列表元素个数一致
- 标量值,index 表达 Series 类型的尺寸
- Python 字典,键值对中的 “键” 是索引,index 从字典中进行选择操作
- ndarray,索引和数据都可以通过 ndarray 类型创建
- 其他函数,range () 函数等
列表
1 2 3 4 5 6 7 8 9 10 import pandas as pds = pd.Series([1 ,2 ,3 ,4 ,5 ]) >>>0 1 >>>1 2 >>>2 3 >>>3 4 >>>4 5 >>>dtype: int64
Pandas 也可以自定义索引。
1 2 3 4 5 6 7 8 9 10 import pandas as pds = pd.Series([1 ,2 ,3 ,4 ,5 ], index=['a' , 'b' , 'c' , 'd' , 'e' ]) >>>a 1 >>>b 2 >>>c 3 >>>d 4 >>>e 5 >>>dtype: int64
标量
1 2 3 4 5 6 7 8 9 10 import pandas as pds = pd.Series(5 , index=['a' , 'b' , 'c' , 'd' , 'e' ]) >>>a 5 >>>b 5 >>>c 5 >>>d 5 >>>e 5 >>>dtype: int64
注意:此时不能省略 index。
字典
1 2 3 4 5 6 7 8 9 10 import pandas as pds = pd.Series({'a' :1 , 'b' :2 , 'c' :3 , 'd' :4 , 'e' :5 }, index=['e' , 'd' , 'c' , 'b' , 'a' ]) >>>e 5 >>>d 4 >>>c 3 >>>b 2 >>>a 1 >>>dtype: int64
注意,index 里面的值的个数可以少于给定字典的键的个数,但是 index 里面每个值必须都是给定字典的键。输出顺序为 index 的顺序。
ndarray
1 2 3 4 5 6 7 8 9 10 11 import pandas as pdimport numpy as nps = pd.Series(np.arange(5 ), index=np.arange(9 ,4 ,-1 )) >>>9 0 >>>8 1 >>>7 2 >>>6 3 >>>5 4 >>>dtype: int64
注意:这里数据类型为 int64。
range()
1 2 3 4 5 6 7 8 9 10 import pandas as pds = pd.Series(range (5 ), index=range (9 ,4 ,-1 )) >>>9 0 >>>8 1 >>>7 2 >>>6 3 >>>5 4 >>>dtype: int64
注意:这里数据类型为 int64。
Series 类型的基本操作
获取所有索引、所有数据
1 2 3 4 5 6 7 8 9 10 11 import pandas as pds = pd.Series([1 ,2 ,3 ,4 ,5 ], index=['a' , 'b' , 'c' , 'd' , 'e' ]) s.index >>>Index([u'a' , u'b' , u'c' , u'd' , u'e' ], dtype='object' ) s.values >>>[1 2 3 4 5 ] type (s.values)>>><type 'numpy.ndarray' >
注意 a.values 返回的是 ndarray 类型。
索引
自动索引和自定义索引并存(但不能混合使用)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import pandas as pds = pd.Series([1 ,2 ,3 ,4 ,5 ], index=['a' , 'b' , 'c' , 'd' , 'e' ]) s[0 ] >>>1 s['a' ] >>>1 s.at['a' ] >>>1 s.iat[0 ] >>>1 s[[1 ,2 ,3 ]] >>>b 2 >>>c 3 >>>d 4 >>>dtype: int64 s[['a' ,'b' ,'c' ]] >>>a 1 >>>b 2 >>>c 3 >>>dtype: int64 s[['a' ,1 ]] >>>a 1.0 >>>1 NaN >>>dtype: float64
切片
NumPy 中运算和操作可用于 Series 类型(运算和操作结果仍然是 Series 类型)
可以通过自定义索引的列表进行切片(切片后的结果仍然是 Series 类型)
可以通过自动索引进行切片,如果存在自定义索引,则一同被切片(切片后的结果仍然是 Series 类型)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import pandas as pdimport numpy as nps = pd.Series([1 ,2 ,3 ,4 ,5 ], index=['a' , 'b' , 'c' , 'd' , 'e' ]) s[3 ] >>>4 s.loc['b' : 'd' ] >>>b 2 >>>c 3 >>>d 4 >>>dtype: int64 s.iloc[1 :4 ] >>>b 2 >>>c 3 >>>d 4 >>>dtype: int64 s[:3 ] >>>a 1 >>>b 2 >>>c 3 >>>dtype: int64 s[s > s.median()] >>>d 4 >>>e 5 >>>dtype: int64 np.exp(s) >>>a 2.718282 >>>b 7.389056 >>>c 20.085537 >>>d 54.598150 >>>e 148.413159 >>>dtype: float64
字典
Python 字典中运算和操作可用于 Series 类型(运算和操作结果仍然是 Series 类型)
通过自定义索引访问
保留字 in 操作(只会判断自定义索引,不会判断自动索引)
使用.get () 方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import pandas as pds = pd.Series([1 ,2 ,3 ,4 ,5 ], index=['a' , 'b' , 'c' , 'd' , 'e' ]) 'c' in s>>>True 3 in s>>>False s.get('c' ) >>>3 s.get('f' ) >>>None s.get('f' , 100 ) >>>100
注意s.get(key):
- 如果key存在于索引中,则返回该索引对应的值。
- 如果key不存在于索引中,则返回None。
- 如果get函数中有第二个参数且key不存在于索引中,则返回第二个参数的值,但不会改变s的值。
对齐
Series 类型在运算中会自动对齐不同索引的数据。
1 2 3 4 5 6 7 8 9 10 11 import pandas as pds1 = pd.Series([1 ,2 ,3 ], index=['a' , 'b' , 'c' ]) s2 = pd.Series([3 ,4 ,5 ], index=['c' , 'd' , 'e' ]) s1 + s2 >>>a NaN >>>b NaN >>>c 6.0 >>>d NaN >>>e NaN >>>dtype: float64
名字
Series 对象和索引都可以有一个名字,存储在属性.name 中。
1 2 3 4 5 6 7 8 9 10 11 import pandas as pds = pd.Series([1 ,2 ,3 ,4 ,5 ], index=['a' , 'b' , 'c' , 'd' , 'e' ]) s.name = 'values' s.index.name = 'indexes' >>>indexes >>>a 1 >>>b 2 >>>c 3 >>>d 4 >>>e 5 >>>Name: values, dtype: int64
修改
Series 对象可以随时修改并即刻生效。
1 2 3 4 5 6 7 8 9 import pandas as pds = pd.Series([1 ,2 ,3 ,4 ,5 ], index=['a' , 'b' , 'c' , 'd' , 'e' ]) s['c' ] = 30 >>>a 1 >>>b 2 >>>c 30 >>>d 4 >>>e 5 >>>dtype: int64
DataFrame 类型(二维)
DataFrame 类型由共用相同索引的一组列组成:
column axis=1
index_0 ------> data_a data_f ... data_v
index_1 ------> data_b data_g ... data_w
rows index_2 ------> data_c data_h ... data_x
axis=0 index_3 ------> data_d data_i ... data_y
index_4 ------> data_e data_j ... data_z
索引 数据
DataFrame 是一个表格型的数据类型,每列值类型可以不同 (类似于 Excel)。
DataFrame 既有行索引、也有列索引。
DataFrame 常用于表达二维数据,但可以表达多维数据。
DataFrame 类型的创建
DataFrame 类型可以由如下类型创建:
- 二维 ndarray 对象
- 由一维 ndarray、列表、字典、元组或 Series 构成的字典
- Series 类型
- 其他 DataFrame 类型
二维 ndarray
1 2 3 4 5 6 7 8 9 10 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 ,2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9
字典
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import pandas as pdimport numpy as npdf_dict = {'one' : np.arange(0 ,10 ,2 ), 'two' : np.arange(1 ,10 ,2 )} df = pd.DataFrame(df_dict, index=['a' , 'b' , 'c' , 'd' , 'e' ]) df_dict = {'one' : [0 ,2 ,4 ,6 ,8 ], 'two' : [1 ,3 ,5 ,7 ,9 ]} df = pd.DataFrame(df_dict, index=['a' , 'b' , 'c' , 'd' , 'e' ]) df_dict = {'one' : {'a' :0 , 'b' :2 , 'c' :4 , 'd' :6 , 'e' :8 }, 'two' : {'a' :1 , 'b' :3 , 'c' :5 , 'd' :7 , 'e' :9 }} df = pd.DataFrame(df_dict) df_dict = {'one' : (0 ,2 ,4 ,6 ,8 ), 'two' : (1 ,3 ,5 ,7 ,9 )} df = pd.DataFrame(df_dict, index=['a' , 'b' , 'c' , 'd' , 'e' ]) df_dict = {'one' : pd.Series(range (0 ,10 ,2 ), index=['a' , 'b' , 'c' , 'd' , 'e' ]), 'two' : pd.Series(range (1 ,10 ,2 ), index=['a' , 'b' , 'c' , 'd' , 'e' ])} df = pd.DataFrame(df_dict) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9
字典内的值可以是 ndarray、列表、字典、元组或 Series 类型。
Series 类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 import pandas as pds1 = pd.Series([0 ,1 ], index=['one' , 'two' ]) s2 = pd.Series([2 ,3 ], index=['one' , 'two' ]) s3 = pd.Series([4 ,5 ], index=['one' , 'two' ]) s4 = pd.Series([6 ,7 ], index=['one' , 'two' ]) s5 = pd.Series([8 ,9 ], index=['one' , 'two' ]) df = pd.DataFrame([s1,s2,s3,s4,s5], index=['a' ,'b' ,'c' ,'d' ,'e' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9
注意:这里每一个 Series 对象的 index 必须一样,否则报错。
其他 DataFrame 类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 ,2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df = pd.DataFrame(a, index=['a' ,'b' ,'c' ], columns=['one' ]) >>> one>>>a 0 >>>b 2 >>>c 4
类似对 DataFrame 进行切片。
读取 CSV 文件
1 2 3 4 5 6 7 8 9 df = pd.read_csv('aaa.csv' ) df = pd.read_csv('aaa.csv' , index_col='bbb' ) df = pd.read_csv(StringIO('one,two\n0,1\n2,3\n4,5\n6,7\n8,9' )) >>> one two>>>0 0 1 >>>1 2 3 >>>2 4 5 >>>3 6 7 >>>4 8 9
DataFrame 类型的基本操作
获取所有索引、所有数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) df.index >>>Index([u'a' , u'b' , u'c' , u'd' , u'e' ], dtype='object' ) df.columns >>>Index([u'one' , u'two' ], dtype='object' ) df.values >>>[[0 1 ] >>> [2 3 ]>>> [4 5 ]>>> [6 7 ]>>> [8 9 ]]
索引
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) df[['one' ]] >>> one>>>a 0 >>>b 2 >>>c 4 >>>d 6 >>>e 8 df['one' ] >>>a 0 >>>b 2 >>>c 4 >>>d 6 >>>e 8 >>>Name: one, dtype: int32 df.loc['c' ] >>>one 4 >>>two 5 >>>Name: c, dtype: int32 df.iloc[1 ] >>>one 2 >>>two 3 >>>Name: b, dtype: int32 df['one' ]['c' ] >>>4 df.loc['c' , 'one' ] >>>4 df.at['c' , 'one' ] >>>4 df.iloc[1 , 0 ] >>>4 df.iat[2 , 0 ] >>>4
切片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) df[0 :3 ] >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 df.head(3 ) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 df.head(-3 ) >>> one two>>>a 0 1 >>>b 2 3 df.take([0 , 1 , 2 ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 df.take([0 ], axis=1 ) >>> one>>>a 0 >>>b 2 >>>c 4 >>>d 6 >>>e 8 df.take([-1 , -2 ]) >>> one two>>>e 8 9 >>>d 6 7 df.tail(2 ) >>> one two>>>d 6 7 >>>e 8 9 df.tail(-2 ) >>> one two>>>c 4 5 >>>d 6 7 >>>e 8 9 df.loc['b' :'d' ] >>> one two>>>b 2 3 >>>c 4 5 >>>d 6 7 df.loc['c' :, ['one' ]] >>> one>>>c 4 >>>d 6 >>>e 8 df[df > df.median()] >>> one two>>>a NaN NaN >>>b NaN NaN >>>c NaN NaN >>>d 6.0 7.0 >>>e 8.0 9.0 np.exp(df) >>> one two>>>a 1.000000 2.718282 >>>b 7.389056 20.085537 >>>c 54.598150 148.413159 >>>d 403.428793 1096.633158 >>>e 2980.957987 8103.083928
条件筛选
简单逻辑判断(<,>, ==, &, |, ~ 等)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) df.loc[df['one' ] > 5 ] df[df['one' ] > 5 ] >>> one two>>>d 6 7 >>>e 8 9 df['two' ][df['one' ] > 5 ] >>>d 7 >>>e 9 >>>Name: two, dtype: int32 df.loc[['two' ]][df['one' ] > 5 ] >>> two>>>d 7 >>>e 9 df.loc[:, ['two' ,'one' ]][df['one' ] > 5 ] df[['two' ,'one' ]][df['one' ] > 5 ] >>> two one>>>d 7 6 >>>e 9 8 df.loc[(df['one' ] > 5 ) | (df['two' ] <5 )] df[(df['one' ] > 5 ) | (df['two' ] <5 )] >>> one two>>>a 0 1 >>>b 2 3 >>>d 6 7 >>>e 8 9 df.loc[df['two' ] != 5 ] df[df['two' ] != 5 ] df.drop(df.loc[df.two == 5 ].index, axis=0 ) df.drop(df[df.two == 5 ].index, axis=0 ) >>> one two>>>a 0 1 >>>b 2 3 >>>d 6 7 >>>e 8 9
自定义函数筛选
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) df.loc[lambda x: x['one' ] * x['two' ] > 5 ] >>> one two>>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df[df.apply(lambda x: x['one' ] * x['two' ] > 5 , axis=1 )] >>> one two>>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 def filter (x ): try : return x > 5 except : return False df[df.apply(filter , axis=1 )] >>> one two>>>a NaN NaN >>>b NaN NaN >>>c NaN NaN >>>d 6.0 7.0 >>>e 8.0 9.0
字典
in 关键字只能检查 columns 的值是否存在。
get 关键字也只能提取列
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) 'c' in df>>>False 'one' in df>>>True df.get('c' ) >>>None df.get('one' ) >>>a 0 >>>b 2 >>>c 4 >>>d 6 >>>e 8 >>>Name: one, dtype: int32
对齐
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import pandas as pdimport numpy as npdf1 = pd.DataFrame(np.arange(6 ).reshape(3 , 2 ), index=['a' ,'b' ,'c' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 df2 = pd.DataFrame(np.arange(4 , 10 ).reshape(3 , 2 ), index=['c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>c 4 5 >>>d 6 7 >>>e 8 9 df1 + df2 >>> one two>>>a NaN NaN >>>b NaN NaN >>>c 8.0 10.0 >>>d NaN NaN >>>e NaN NaN
名字
1 2 3 4 5 6 7 8 9 10 11 12 13 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) df.columns.name = 'columns' df.index.name = 'indexes' >>>columns one two >>>indexes >>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9
修改
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) s = {8 ,6 ,4 ,2 ,0 } >>>set ([0 , 8 , 2 , 4 , 6 ]) df['one' ] = {8 ,6 ,4 ,2 ,0 } >>> one two>>>a 0 1 >>>b 8 3 >>>c 2 5 >>>d 4 7 >>>e 6 9 df['one' ] = [8 ,6 ,4 ,2 ,0 ] >>> one two>>>a 8 1 >>>b 6 3 >>>c 4 5 >>>d 2 7 >>>e 0 9 df['one' ] = (8 ,6 ,4 ,2 ,0 ) >>> one two>>>a 8 1 >>>b 6 3 >>>c 4 5 >>>d 2 7 >>>e 0 9
字典会自动重排顺序。
Pandas 的数据类型操作
Pandas 提供各种函数对 Series 类型和 DataFrame 类型进行操作,两种类型操作类似,下面例子以 DataFrame 为例。
重新索引
.reindex () 能够改变或重排 Series 和 DataFrame 索引
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df.reindex(index=['e' ,'d' ,'c' ,'b' ,'a' ]) >>> one two>>>e 8 9 >>>d 6 7 >>>c 4 5 >>>b 2 3 >>>a 0 1 df.reindex(columns=['two' ,'one' ,'three' ]) >>> two one three>>>a 1 0 NaN >>>b 3 2 NaN >>>c 5 4 NaN >>>d 7 6 NaN >>>e 9 8 NaN df.set_index('one' ) >>> two>>>one >>>0 1 >>>2 3 >>>4 5 >>>6 7 >>>8 9 df.reset_index() >>> index one two>>>0 a 0 1 >>>1 b 2 3 >>>2 c 4 5 >>>3 d 6 7 >>>4 e 8 9 df.reset_index(drop=True ) >>> one two>>>0 0 1 >>>1 2 3 >>>2 4 5 >>>3 6 7 >>>4 8 9 df.rename(columns={'one' : 'x' , 'two' : 'y' }) >>> x y>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df.rename({'a' : 1 , 'b' : 2 , 'c' : 3 , 'd' : 4 , 'e' : 5 }) >>> one two>>>1 0 1 >>>2 2 3 >>>3 4 5 >>>4 6 7 >>>5 8 9 df.rename(str .upper) >>> one two>>>A 0 1 >>>B 2 3 >>>C 4 5 >>>D 6 7 >>>E 8 9
插入
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import pandas as pdimport numpy as npdf1 = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df2 = df1.columns.insert(1 , 'three' ) >>>Index([u'one' , u'three' , u'two' ], dtype='object' ) df3 = df1.reindex(columns=df2, fill_value=5 ) >>> one three two>>>a 0 5 1 >>>b 2 5 3 >>>c 4 5 5 >>>d 6 5 7 >>>e 8 5 9 df4 = df1.index.insert(3 , 'f' ) >>>Index([u'a' , u'b' , u'c' , u'f' , u'd' , u'e' ], dtype='object' ) df5 = df1.reindex(index=df4, fill_value=5 ) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>f 5 5 >>>d 6 7 >>>e 8 9
合并
合并官方教程
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 import pandas as pdimport numpy as npdf1 = pd.DataFrame(np.arange(10 ).reshape(5 , 2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df2 = pd.DataFrame(np.arange(10 , 16 ).reshape(3 , 2 ), index=['f' ,'g' ,'h' ], columns=['one' , 'two' ]) >>> one two>>>f 10 11 >>>g 12 13 >>>h 14 15 df3 = pd.DataFrame(np.arange(16 , 22 ).reshape(3 , 2 ), index=['i' ,'j' ,'k' ], columns=['one' , 'two' ]) >>> one two>>>i 16 17 >>>j 18 19 >>>k 20 21 df4 = pd.DataFrame(np.arange(10 , 25 ).reshape(5 , 3 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['three' , 'four' , 'five' ]) >>> three four five>>>a 10 11 12 >>>b 13 14 15 >>>c 16 17 18 >>>d 19 20 21 >>>e 22 23 24 df5 = pd.concat([df1, df2, df3], keys=['x' , 'y' , 'z' ]) >>> one two>>>x a 0 1 >>> b 2 3 >>> c 4 5 >>> d 6 7 >>> e 8 9 >>>y f 10 11 >>> g 12 13 >>> h 14 15 >>>z i 16 17 >>> j 18 19 >>> k 20 21 df5.loc['y' ] >>> one two>>>f 10 11 >>>g 12 13 >>>h 14 15 df5.loc['y' ].loc['g' ] >>>one 12 >>>two 13 >>>Name: g, dtype: int32 pd.concat([df1, df4], axis=1 ) >>> one two three four five>>>a 0 1 10 11 12 >>>b 2 3 13 14 15 >>>c 4 5 16 17 18 >>>d 6 7 19 20 21 >>>e 8 9 22 23 24 df6 = pd.DataFrame(np.arange(10 , 16 ).reshape(3 , 2 ), index=['d' ,'e' ,'f' ], columns=['two' , 'three' ]) >>> two three>>>d 10 11 >>>e 12 13 >>>f 14 15 pd.concat([df1, df6], axis=1 ) >>> one two two three>>>a 0.0 1.0 NaN NaN >>>b 2.0 3.0 NaN NaN >>>c 4.0 5.0 NaN NaN >>>d 6.0 7.0 10.0 11.0 >>>e 8.0 9.0 12.0 13.0 >>>f NaN NaN 14.0 15.0 pd.concat([df1, df6], axis=1 , join='inner' ) >>> one two two three>>>d 6 7 10 11 >>>e 8 9 12 13 df1.append(df2) one two a 0 1 b 2 3 c 4 5 d 6 7 e 8 9 f 10 11 g 12 13 h 14 15 df1.append([df2, df3]) one two a 0 1 b 2 3 c 4 5 d 6 7 e 8 9 f 10 11 g 12 13 h 14 15 i 16 17 j 18 19 k 20 21 df1.append(df6) one two three a 0.0 1 NaN b 2.0 3 NaN c 4.0 5 NaN d 6.0 7 NaN e 8.0 9 NaN d NaN 10 11.0 e NaN 12 13.0 f NaN 14 15.0 df1.append(df6, ignore_index=True ) one two three 0 0.0 1 NaN1 2.0 3 NaN2 4.0 5 NaN3 6.0 7 NaN4 8.0 9 NaN5 NaN 10 11.0 6 NaN 12 13.0 7 NaN 14 15.0
索引
Series 和 DataFrame 的索引是 Index 类型,Index 对象是不可修改类型
Index 类型的常用方法:
idx.append(idx)
连接另一个 Index 对象,产生新的 Index 对象
idx.diff(idx)
计算差集,产生新的 Index 对象
idx.intersection(idx)
计算交集,产生新的 Index 对象
idx.union(idx)
计算并集,产生新的 Index 对象
idx.delete(loc)
删除 loc 位置处的元素,产生新的 Index 对象
idx.insert(loc, e)
在 loc 位置处增加一个元素 e,产生新的 Index 对象
Series/DataFrame.drop(idx.vaule, axis=0/1)
删除 Series 和 DataFrame 指定行或列索引,axis=1 表示列(默认 axis=0)
数据类型运算
算术运算法则
算术运算根据行列索引,补齐后运算,运算默认产生浮点数。
补齐时缺项填充 NaN (空值)。
二维和一维、一维和零维间为广播运算(低维对象元素会作用到高维对象的每一个元素)。
采用 +、‐、*、/ 符号进行的二元运算产生新的对象。
方法形式的运算可通过指定参数避免上面的 NaN 的产生。
.add(d, **argws)
类型间的加法运算,可选参数
.sub(d, **argws)
类型间的减法运算,可选参数
.mul(d, **argws)
类型间的乘法运算,可选参数
.div(d, **argws)
类型间的除法运算,可选参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 import pandas as pdimport numpy as npdf1 = pd.DataFrame(np.arange(3 ).reshape(3 ,1 ), index=['a' ,'b' ,'c' ], columns=['one' ]) >>> one>>>a 0 >>>b 1 >>>c 2 df2 = pd.DataFrame(np.arange(10 ).reshape(5 ,2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df1 + df2 >>> one two>>>a 0.0 NaN >>>b 3.0 NaN >>>c 6.0 NaN >>>d NaN NaN >>>e NaN NaN df1.add(df2, fill_value=5 ) >>> one two>>>a 0.0 6.0 >>>b 3.0 8.0 >>>c 6.0 10.0 >>>d 11.0 12.0 >>>e 13.0 14.0 df3 = pd.Series([5 , 10 ], index=['one' , 'two' ]) >>>one 5 >>>two 10 >>>dtype: int32 df3 + 5 >>>one 10 >>>two 15 >>>dtype: int32 df2 + df3 >>> one two>>>a 5 11 >>>b 7 13 >>>c 9 15 >>>d 11 17 >>>e 13 19 df3 = pd.Series([5 , 10 , 15 ], index=['a' , 'b' , 'c' ]) >>>a 5 >>>b 10 >>>c 15 >>>dtype: int32 df2.add(df3, axis=0 ) >>> one two>>>a 5.0 6.0 >>>b 12.0 13.0 >>>c 19.0 20.0 >>>d NaN NaN >>>e NaN NaN
比较运算法则
比较运算只能比较相同索引 的元素,不进行补齐 。
二维和一维、一维和零维间为广播运算。
采用 >、<、>=、<=、==、!= 等符号进行的二元运算产生布尔对象。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 import pandas as pdimport numpy as npdf1 = pd.DataFrame(np.arange(15 ,5 ,-1 ).reshape(5 ,2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 15 14 >>>b 13 12 >>>c 11 10 >>>d 9 8 >>>e 7 6 df2 = pd.DataFrame(np.arange(10 ).reshape(5 ,2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df1 > df2 >>> one two>>>a True True >>>b True True >>>c True True >>>d True True >>>e False False df3 = pd.Series([5 ,10 ,15 ], index=['a' , 'b' , 'c' ]) >>>one 5 >>>two 10 >>>dtype: int32 df3 > 5 >>>one False >>>two True >>>dtype: bool df3 > df2 >>> one two>>>a True True >>>b True True >>>c True True >>>d False True >>>e False True
单列 / 多列 / 分组 / 聚合运算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 ,2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) >>> one two>>>a 0 1 >>>b 2 3 >>>c 4 5 >>>d 6 7 >>>e 8 9 df.loc[:, 'one' ] = df.loc[:, 'one' ].map (lambda x: x ** 2 ) >>> one two>>>a 0 1 >>>b 4 3 >>>c 16 5 >>>d 36 7 >>>e 64 9 def square (x ): return x ** 2 df.loc[:, 'one' ] = df.loc[:, 'one' ].map (square) >>> one two>>>a 0 1 >>>b 4 3 >>>c 16 5 >>>d 36 7 >>>e 64 9 df.loc[:, 'one' ] = df.loc[:, 'one' ].map (lambda x: True if x >= 5 else False ) df.loc[:, 'two' ] = df.loc[:, 'two' ].map (lambda x: True if x >= 5 else False ) >>> one two >>>a False False >>>b False False >>>c False True >>>d True True >>>e True True df.loc[:, 'three' ] = df.apply(lambda x: x['one' ] + 2 * x['two' ], axis=1 ) >>> one two three>>>a 0 1 2 >>>b 2 3 8 >>>c 4 5 14 >>>d 6 7 20 >>>e 8 9 26 df.loc['f' , :] = df.apply(lambda x: x['a' ] + 2 * x['b' ], axis=0 ) >>> one two>>>a 0.0 1.0 >>>b 2.0 3.0 >>>c 4.0 5.0 >>>d 6.0 7.0 >>>e 8.0 9.0 >>>f 4.0 7.0 df = df.applymap(lambda x: x ** 2 if x <= 5 else x * 2 ) >>> one two>>>a 0 1 >>>b 4 9 >>>c 16 25 >>>d 12 14 >>>e 16 18
Pandas 的数据特征分析
排序
.sort_index(axis=0, ascending=True)
方法在指定轴上根据索引进行排序,默认升序。
Series.sort_values(axis=0, ascending=True)
、DataFrame.sort_values(by, axis=0, ascending=True)
方法在指定轴上根据数值进行排序,默认升序。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 ,2 ), index=['c' ,'b' ,'a' ,'e' ,'d' ], columns=['two' , 'one' ]) >>> two one>>>c 0 1 >>>b 2 3 >>>a 4 5 >>>e 6 7 >>>d 8 9 df.sort_index() >>> two one>>>a 4 5 >>>b 2 3 >>>c 0 1 >>>d 8 9 >>>e 6 7 df.sort_index(ascending=False ) >>> two one>>>e 6 7 >>>d 8 9 >>>c 0 1 >>>b 2 3 >>>a 4 5 df.sort_index(axis=1 ) >>> one two>>>c 1 0 >>>b 3 2 >>>a 5 4 >>>e 7 6 >>>d 9 8 df.sort_values('two' , ascending=False ) >>> two one>>>d 8 9 >>>e 6 7 >>>a 4 5 >>>b 2 3 >>>c 0 1 df.sort_values('c' , axis=1 , ascending=False ) >>> one two>>>c 1 0 >>>b 3 2 >>>a 5 4 >>>e 7 6 >>>d 9 8
注意:排序时,NaN永远都是在排序结果末尾(不管是升序 还是 降序)
统计
适用于 Series 和 DataFrame 类型数据,基本统计分析
.sum()
计算数据的总和,按 0 轴计算,下同
.count()
非 NaN 值的数量
.mean() .median()
计算数据的算术平均值、算术中位数
.var() .std()
计算数据的方差、标准差
.min() .max()
计算数据的最小值、最大值
.describe()
针对 0 轴(各列)的统计汇总
适用于 Series 类型,基本统计分析
.argmin() .argmax()
计算数据最大值、最小值所在位置的索引位置(自动索引)
.idxmin() .idxmax()
计算数据最大值、最小值所在位置的索引位置(自定义索引)
适用于 Series 和 DataFrame 类型,累计计算
.cumsum()
依次给出前 1、2、…、n 个数的和
.cumprod()
依次给出前 1、2、…、n 个数的积
.cummax()
依次给出前 1、2、…、n 个数的最大值
.cummin()
依次给出前 1、2、…、n 个数的最小值
适用于 Series 和 DataFrame 类型,滚动计算(窗口计算)
.rolling(w).sum()
依次计算相邻 w 个元素的和
.rolling(w).mean()
依次计算相邻 w 个元素的算术平均值
.rolling(w).var()
依次计算相邻 w 个元素的方差
.rolling(w).std()
依次计算相邻 w 个元素的标准差
.rolling(w).min() .max()
依次计算相邻 w 个元素的最小值和最大值
适用于 Series 和 DataFrame 类型,相关性分析
.cov()
计算协方差矩阵
.corr()
计算相关系数矩阵,Pearson、Spearman、Kendall 等系数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 import pandas as pdimport numpy as npdf = pd.DataFrame(np.arange(10 ).reshape(5 ,2 ), index=['a' ,'b' ,'c' ,'d' ,'e' ], columns=['one' , 'two' ]) df['one' ] = [1 , 1 , 1 , 2 , 2 ] >>> one two>>>a 1 1 >>>b 1 3 >>>c 1 5 >>>d 2 7 >>>e 2 9 df.groupby('one' )['two' ].sum () >>>one >>>1 9 >>>2 16 >>>Name: two, dtype: int32 df.groupby([3 , 3 , 4 , 4 , 4 ]).sum () >>> one two>>>3 2 4 >>>4 5 21 df.groupby('one' )['two' ].describe() >>> count mean std min 25 % 50 % 75 % max >>>one >>>1 3.0 3.0 2.000000 1.0 2.0 3.0 4.0 5.0 >>>2 2.0 8.0 1.414214 7.0 7.5 8.0 8.5 9.0 df['three' ] = df.groupby('one' )['two' ].transform(lambda x: (x.sum () - x) / x.count()) >>> one two three>>>a 1 1 2.666667 >>>b 1 3 2.000000 >>>c 1 5 1.333333 >>>d 2 7 4.500000 >>>e 2 9 3.500000 df.groupby('one' ).agg(['sum' , 'count' , 'mean' , 'median' , 'var' , 'std' , 'min' , 'max' , 'first' , 'last' ]) >>> two >>> sum count mean median var std min max first last>>>one >>>1 9 3 3 3 4 2.000000 1 5 1 5 >>>2 16 2 8 8 2 1.414214 7 9 7 9 df.agg(['sum' , 'count' , 'mean' , 'median' , 'var' , 'std' , 'min' , 'max' , 'first' , 'last' ]) >>> one two>>>sum 7.000000 25.000000 >>>count 5.000000 5.000000 >>>mean 1.400000 5.000000 >>>median 1.000000 5.000000 >>>var 0.300000 10.000000 >>>std 0.547723 3.162278 >>>min 1.000000 1.000000 >>>max 2.000000 9.000000