Python Pandas 库基础

Posted on 2017-10-04 Edited on 2021-10-18 In Python Symbols count in article: 31k Reading time ≈ 57 mins.

Pandas 是 Python 第三方库，提供高性能易用数据类型和分析工具，Pandas 基于 NumPy 实现，常与 NumPy 和 Matplotlib 一同使用。
Reference：【MOOC】Python 数据分析与展示 - 北京理工大学 -【第三周】数据分析之概要 ; 公开课 ;Document;GitHub

Pandas 库

一般通过 import pandas as pd 来引用 Pandas 库。
pd 通常为该模块的别名。
与 numpy 的区别：

NumPy	Pandas
基础数据类型	扩展数据类型
关注数据的结构表达	关注数据的应用表达
维度：数据间关系	数据与索引的关系

该库基于 numpy 提供了两个新的数据类型：Series, DataFrame。
基于上述数据类型有各类操作：基本操作、运算操作、特征类操作、关联类操作。

Series 类型（一维）

Series 类型由一组数据及与之相关的数据索引组成：
index_0 ------> data_a
index_1 ------> data_b
index_2 ------> data_c
index_3 ------> data_d
index_4 ------> data_e
　索引　　　数据

Series 是一维带 “标签” 数组 (Series 类型包括 index 和 values 两部分，index 和 values 一一对应)

Series 类型的创建

Series 类型可以由如下类型创建：
- Python 列表，index 与列表元素个数一致
- 标量值，index 表达 Series 类型的尺寸
- Python 字典，键值对中的 “键” 是索引，index 从字典中进行选择操作
- ndarray，索引和数据都可以通过 ndarray 类型创建
- 其他函数，range () 函数等

列表

import pandas as pd

s = pd.Series([1,2,3,4,5])
>>>0    1
>>>1    2
>>>2    3
>>>3    4
>>>4    5
>>>dtype: int64
# 第一列为自动索引，dtype后为数据类型。

Pandas 也可以自定义索引。

import pandas as pd

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
>>>a    1
>>>b    2
>>>c    3
>>>d    4
>>>e    5
>>>dtype: int64
# 第一列为给定索引，dtype后为数据类型。

标量

import pandas as pd

s = pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])
>>>a    5
>>>b    5
>>>c    5
>>>d    5
>>>e    5
>>>dtype: int64
# 第一列为给定索引，dtype后为数据类型。

注意：此时不能省略 index。

字典

import pandas as pd

s = pd.Series({'a':1, 'b':2, 'c':3, 'd':4, 'e':5}, index=['e', 'd', 'c', 'b', 'a'])
>>>e    5
>>>d    4
>>>c    3
>>>b    2
>>>a    1
>>>dtype: int64
# 第一列为给定索引，dtype后为数据类型。

注意，index 里面的值的个数可以少于给定字典的键的个数，但是 index 里面每个值必须都是给定字典的键。输出顺序为 index 的顺序。

ndarray

import pandas as pd
import numpy as np

s = pd.Series(np.arange(5), index=np.arange(9,4,-1))
>>>9    0
>>>8    1
>>>7    2
>>>6    3
>>>5    4
>>>dtype: int64
# 第一列为给定索引，dtype后为数据类型。

注意：这里数据类型为 int64。

range()

import pandas as pd

s = pd.Series(range(5), index=range(9,4,-1))
>>>9    0
>>>8    1
>>>7    2
>>>6    3
>>>5    4
>>>dtype: int64
# 第一列为给定索引，dtype后为数据类型。

注意：这里数据类型为 int64。

Series 类型的基本操作

获取所有索引、所有数据

import pandas as pd

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
s.index
>>>Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

s.values
>>>[1 2 3 4 5]

type(s.values)
>>><type 'numpy.ndarray'>

注意 a.values 返回的是 ndarray 类型。

索引

自动索引和自定义索引并存（但不能混合使用）

import pandas as pd

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
s[0]
>>>1

s['a']
>>>1

s.at['a']
>>>1

s.iat[0]
>>>1

s[[1,2,3]]
>>>b    2
>>>c    3
>>>d    4
>>>dtype: int64

s[['a','b','c']]
>>>a    1
>>>b    2
>>>c    3
>>>dtype: int64

s[['a',1]] # 自动索引和自定义索引不能混用。
>>>a    1.0
>>>1    NaN
>>>dtype: float64

切片

NumPy 中运算和操作可用于 Series 类型（运算和操作结果仍然是 Series 类型）
可以通过自定义索引的列表进行切片（切片后的结果仍然是 Series 类型）
可以通过自动索引进行切片，如果存在自定义索引，则一同被切片（切片后的结果仍然是 Series 类型）

import pandas as pd
import numpy as np

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
s[3]
>>>4

s.loc['b': 'd']
>>>b    2
>>>c    3
>>>d    4
>>>dtype: int64

s.iloc[1:4]
>>>b    2
>>>c    3
>>>d    4
>>>dtype: int64

s[:3]
>>>a    1
>>>b    2
>>>c    3
>>>dtype: int64

s[s > s.median()]
>>>d    4
>>>e    5
>>>dtype: int64

np.exp(s)
>>>a      2.718282
>>>b      7.389056
>>>c     20.085537
>>>d     54.598150
>>>e    148.413159
>>>dtype: float64

字典

Python 字典中运算和操作可用于 Series 类型（运算和操作结果仍然是 Series 类型）
通过自定义索引访问
保留字 in 操作（只会判断自定义索引，不会判断自动索引）
使用.get () 方法

import pandas as pd

s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
'c' in s
>>>True

3 in s
>>>False

s.get('c')
>>>3

s.get('f')
>>>None

s.get('f', 100)
>>>100

注意s.get(key)：
- 如果key存在于索引中，则返回该索引对应的值。
- 如果key不存在于索引中，则返回None。
- 如果get函数中有第二个参数且key不存在于索引中，则返回第二个参数的值，但不会改变s的值。

对齐

Series 类型在运算中会自动对齐不同索引的数据。

import pandas as pd

s1 = pd.Series([1,2,3], index=['a', 'b', 'c'])
s2 = pd.Series([3,4,5], index=['c', 'd', 'e'])
s1 + s2
>>>a    NaN
>>>b    NaN
>>>c    6.0
>>>d    NaN
>>>e    NaN
>>>dtype: float64

名字

Series 对象和索引都可以有一个名字，存储在属性.name 中。

import pandas as pd
s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
s.name = 'values'
s.index.name = 'indexes'
>>>indexes
>>>a    1
>>>b    2
>>>c    3
>>>d    4
>>>e    5
>>>Name: values, dtype: int64

修改

Series 对象可以随时修改并即刻生效。

import pandas as pd
s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
s['c'] = 30
>>>a    1
>>>b    2
>>>c   30
>>>d    4
>>>e    5
>>>dtype: int64

DataFrame 类型（二维）

DataFrame 类型由共用相同索引的一组列组成：
　　　　　　column　　　　　　　axis=1
　　　　　　index_0 ------> data_a data_f ... data_v
　　　　　　index_1 ------> data_b data_g ... data_w
rows　　　　index_2 ------> data_c data_h ... data_x
axis=0　　　index_3 ------> data_d data_i ... data_y
　　　　　　index_4 ------> data_e data_j ... data_z
　　　　　　　索引　　　　　　　　数据
DataFrame 是一个表格型的数据类型，每列值类型可以不同 (类似于 Excel)。
DataFrame 既有行索引、也有列索引。
DataFrame 常用于表达二维数据，但可以表达多维数据。

DataFrame 类型的创建

DataFrame 类型可以由如下类型创建：
- 二维 ndarray 对象
- 由一维 ndarray、列表、字典、元组或 Series 构成的字典
- Series 类型
- 其他 DataFrame 类型

二维 ndarray

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5,2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

字典

import pandas as pd
import numpy as np

df_dict = {'one': np.arange(0,10,2), 'two': np.arange(1,10,2)}
df = pd.DataFrame(df_dict, index=['a', 'b', 'c', 'd', 'e'])

df_dict = {'one': [0,2,4,6,8], 'two': [1,3,5,7,9]}
df = pd.DataFrame(df_dict, index=['a', 'b', 'c', 'd', 'e'])

df_dict = {'one': {'a':0, 'b':2, 'c':4, 'd':6, 'e':8}, 'two': {'a':1, 'b':3, 'c':5, 'd':7, 'e':9}}
df = pd.DataFrame(df_dict)

df_dict = {'one': (0,2,4,6,8), 'two': (1,3,5,7,9)}
df = pd.DataFrame(df_dict, index=['a', 'b', 'c', 'd', 'e'])

df_dict = {'one': pd.Series(range(0,10,2), index=['a', 'b', 'c', 'd', 'e']), 'two': pd.Series(range(1,10,2), index=['a', 'b', 'c', 'd', 'e'])}
df = pd.DataFrame(df_dict)

# 上述所有的创建方式的结果都一样
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

字典内的值可以是 ndarray、列表、字典、元组或 Series 类型。

Series 类型

import pandas as pd

s1 = pd.Series([0,1], index=['one', 'two'])
s2 = pd.Series([2,3], index=['one', 'two'])
s3 = pd.Series([4,5], index=['one', 'two'])
s4 = pd.Series([6,7], index=['one', 'two'])
s5 = pd.Series([8,9], index=['one', 'two'])
df = pd.DataFrame([s1,s2,s3,s4,s5], index=['a','b','c','d','e'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

注意：这里每一个 Series 对象的 index 必须一样，否则报错。

其他 DataFrame 类型

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5,2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

df = pd.DataFrame(a, index=['a','b','c'], columns=['one'])
>>>   one
>>>a    0
>>>b    2
>>>c    4

类似对 DataFrame 进行切片。

读取 CSV 文件

df = pd.read_csv('aaa.csv') # 读取aaa.csv文件
df = pd.read_csv('aaa.csv', index_col='bbb') # 指定行标签 label
df = pd.read_csv(StringIO('one,two\n0,1\n2,3\n4,5\n6,7\n8,9')) # 读取csv格式的字符串
>>>   one  two
>>>0    0    1
>>>1    2    3
>>>2    4    5
>>>3    6    7
>>>4    8    9

DataFrame 类型的基本操作

获取所有索引、所有数据

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
df.index
>>>Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

df.columns
>>>Index([u'one', u'two'], dtype='object')

df.values
>>>[[0 1]
>>> [2 3]
>>> [4 5]
>>> [6 7]
>>> [8 9]]

索引

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
df[['one']] # 返回一个DataFrame类型
>>>   one
>>>a    0
>>>b    2
>>>c    4
>>>d    6
>>>e    8

df['one'] # 返回一个Series类型，注意这里不能用自动索引
>>>a    0
>>>b    2
>>>c    4
>>>d    6
>>>e    8
>>>Name: one, dtype: int32

df.loc['c'] # 返回一个Series类型，注意这里不能用自动索引
>>>one    4
>>>two    5
>>>Name: c, dtype: int32
        
df.iloc[1] # 返回第二行，Series类型，不可直接用a[1]
>>>one    2
>>>two    3
>>>Name: b, dtype: int32

df['one']['c'] # 返回一个numpy.int32类型，注意这里顺序为先列后行
>>>4

df.loc['c', 'one'] # 返回一个numpy.int32类型，注意这里顺序为先行后列
>>>4

df.at['c', 'one'] # 返回一个numpy.int32类型，注意这里顺序为先行后列
>>>4

df.iloc[1, 0] # 返回一个numpy.int32类型，注意这里顺序为先行后列
>>>4

df.iat[2, 0] # 返回一个numpy.int32类型，注意这里顺序为先行后列
>>>4

切片

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
df[0:3] # 返回前三行
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5

df.head(3) # 返回前三行
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5

df.head(-3) # 返回前两行
>>>   one  two
>>>a    0    1
>>>b    2    3

df.take([0, 1, 2]) # 返回第一、二、三行
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5

df.take([0], axis=1) # 返回第一列
>>>   one
>>>a    0
>>>b    2
>>>c    4
>>>d    6
>>>e    8

df.take([-1, -2]) # 返回倒数第一、二行
>>>   one  two
>>>e    8    9
>>>d    6    7

df.tail(2) # 返回倒数两行，注意和上方顺序不一样
>>>   one  two
>>>d    6    7
>>>e    8    9

df.tail(-2) # 返回倒数三行
>>>   one  two
>>>c    4    5
>>>d    6    7
>>>e    8    9

df.loc['b':'d'] # 与切片不同，这种情况下包含开头也包含结束
>>>   one  two
>>>b    2    3
>>>c    4    5
>>>d    6    7

df.loc['c':, ['one']] # 行标签从c到最后行，且只选取one这一列
>>>   one
>>>c    4
>>>d    6
>>>e    8

df[df > df.median()]
>>>   one  two
>>>a  NaN  NaN
>>>b  NaN  NaN
>>>c  NaN  NaN
>>>d  6.0  7.0
>>>e  8.0  9.0

np.exp(df)
>>>           one          two
>>>a     1.000000     2.718282
>>>b     7.389056    20.085537
>>>c    54.598150   148.413159
>>>d   403.428793  1096.633158
>>>e  2980.957987  8103.083928

条件筛选

简单逻辑判断（<,>, ==, &, |, ~ 等）

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
df.loc[df['one'] > 5] # one属性大于5的记录
df[df['one'] > 5]
>>>   one  two
>>>d    6    7
>>>e    8    9

df['two'][df['one'] > 5] # 返回Series类型
>>>d    7
>>>e    9
>>>Name: two, dtype: int32

df.loc[['two']][df['one'] > 5] # 返回DataFrame类型
>>>   two
>>>d    7
>>>e    9

df.loc[:, ['two','one']][df['one'] > 5] # 调换index顺序
df[['two','one']][df['one'] > 5]
>>>   two  one
>>>d    7    6
>>>e    9    8

df.loc[(df['one'] > 5) | (df['two'] <5)] # one属性大于5或者two属性小于5的记录，注意每个判断表达式都要用括号括起来
df[(df['one'] > 5) | (df['two'] <5)]
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>d    6    7
>>>e    8    9

df.loc[df['two'] != 5] # 删除某符合条件的行
df[df['two'] != 5]
df.drop(df.loc[df.two == 5].index, axis=0)
df.drop(df[df.two == 5].index, axis=0)
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>d    6    7
>>>e    8    9

自定义函数筛选

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
df.loc[lambda x: x['one'] * x['two'] > 5] # 函数入参x是整个DataFrame
>>>   one  two
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

df[df.apply(lambda x: x['one'] * x['two'] > 5, axis=1)] # 函数入参x是一行数据Series
>>>   one  two
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

def filter(x):
    try:
        return x > 5
    except:
        return False

df[df.apply(filter, axis=1)] # 函数作为apply的参数
>>>   one  two
>>>a  NaN  NaN
>>>b  NaN  NaN
>>>c  NaN  NaN
>>>d  6.0  7.0
>>>e  8.0  9.0

字典

in 关键字只能检查 columns 的值是否存在。
get 关键字也只能提取列

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
'c' in df
>>>False

'one' in df
>>>True

df.get('c')
>>>None

df.get('one')
>>>a    0
>>>b    2
>>>c    4
>>>d    6
>>>e    8
>>>Name: one, dtype: int32

对齐

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a','b','c'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5

df2 = pd.DataFrame(np.arange(4, 10).reshape(3, 2), index=['c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>c    4    5
>>>d    6    7
>>>e    8    9

df1 + df2
>>>   one   two
>>>a  NaN   NaN
>>>b  NaN   NaN
>>>c  8.0  10.0
>>>d  NaN   NaN
>>>e  NaN   NaN

名字

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
df.columns.name = 'columns'
df.index.name = 'indexes'
>>>columns  one  two
>>>indexes          
>>>a          0    1
>>>b          2    3
>>>c          4    5
>>>d          6    7
>>>e          8    9

修改

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
s = {8,6,4,2,0}
>>>set([0, 8, 2, 4, 6])

df['one'] = {8,6,4,2,0}
>>>   one  two
>>>a    0    1
>>>b    8    3
>>>c    2    5
>>>d    4    7
>>>e    6    9

df['one'] = [8,6,4,2,0]
>>>   one  two
>>>a    8    1
>>>b    6    3
>>>c    4    5
>>>d    2    7
>>>e    0    9

df['one'] = (8,6,4,2,0)
>>>   one  two
>>>a    8    1
>>>b    6    3
>>>c    4    5
>>>d    2    7
>>>e    0    9

字典会自动重排顺序。

Pandas 的数据类型操作

Pandas 提供各种函数对 Series 类型和 DataFrame 类型进行操作，两种类型操作类似，下面例子以 DataFrame 为例。

重新索引

.reindex () 能够改变或重排 Series 和 DataFrame 索引

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

df.reindex(index=['e','d','c','b','a'])
>>>   one  two
>>>e    8    9
>>>d    6    7
>>>c    4    5
>>>b    2    3
>>>a    0    1

df.reindex(columns=['two','one','three']) # 注意仅仅是重排索引，如果列名不存在，整列为NaN
>>>   two  one  three
>>>a    1    0    NaN
>>>b    3    2    NaN
>>>c    5    4    NaN
>>>d    7    6    NaN
>>>e    9    8    NaN

df.set_index('one') # 将某列变成索引
>>>     two
>>>one     
>>>0      1
>>>2      3
>>>4      5
>>>6      7
>>>8      9

df.reset_index() # 新生成数字升序索引，原索引变成新的列
>>>  index  one  two
>>>0     a    0    1
>>>1     b    2    3
>>>2     c    4    5
>>>3     d    6    7
>>>4     e    8    9

df.reset_index(drop=True) # 新生成数字升序索引，原索引丢弃
>>>   one  two
>>>0    0    1
>>>1    2    3
>>>2    4    5
>>>3    6    7
>>>4    8    9

df.rename(columns={'one': 'x', 'two': 'y'}) # 列索引重命名
>>>   x  y
>>>a  0  1
>>>b  2  3
>>>c  4  5
>>>d  6  7
>>>e  8  9

df.rename({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}) # 行索引重命名
>>>   one  two
>>>1    0    1
>>>2    2    3
>>>3    4    5
>>>4    6    7
>>>5    8    9

df.rename(str.upper) # 行索引更改种类；如需要改变列索引种类则增加参数axis='columns'
>>>   one  two
>>>A    0    1
>>>B    2    3
>>>C    4    5
>>>D    6    7
>>>E    8    9

插入

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

df2 = df1.columns.insert(1, 'three')
>>>Index([u'one', u'three', u'two'], dtype='object')

df3 = df1.reindex(columns=df2, fill_value=5)
>>>   one  three  two
>>>a    0      5    1
>>>b    2      5    3
>>>c    4      5    5
>>>d    6      5    7
>>>e    8      5    9

df4 = df1.index.insert(3, 'f')
>>>Index([u'a', u'b', u'c', u'f', u'd', u'e'], dtype='object')

df5 = df1.reindex(index=df4, fill_value=5)
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>f    5    5
>>>d    6    7
>>>e    8    9

合并

合并官方教程

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(10).reshape(5, 2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

df2 = pd.DataFrame(np.arange(10, 16).reshape(3, 2), index=['f','g','h'], columns=['one', 'two'])
>>>   one  two
>>>f   10   11
>>>g   12   13
>>>h   14   15

df3 = pd.DataFrame(np.arange(16, 22).reshape(3, 2), index=['i','j','k'], columns=['one', 'two'])
>>>   one  two
>>>i   16   17
>>>j   18   19
>>>k   20   21

df4 = pd.DataFrame(np.arange(10, 25).reshape(5, 3), index=['a','b','c','d','e'], columns=['three', 'four', 'five'])
>>>   three  four  five
>>>a     10    11    12
>>>b     13    14    15
>>>c     16    17    18
>>>d     19    20    21
>>>e     22    23    24

df5 = pd.concat([df1, df2, df3], keys=['x', 'y', 'z']) # 行合并
>>>     one  two
>>>x a    0    1
>>>  b    2    3
>>>  c    4    5
>>>  d    6    7
>>>  e    8    9
>>>y f   10   11
>>>  g   12   13
>>>  h   14   15
>>>z i   16   17
>>>  j   18   19
>>>  k   20   21

df5.loc['y']
>>>   one  two
>>>f   10   11
>>>g   12   13
>>>h   14   15

df5.loc['y'].loc['g']
>>>one    12
>>>two    13
>>>Name: g, dtype: int32
    
pd.concat([df1, df4], axis=1) # 列合并
>>>   one  two  three  four  five
>>>a    0    1     10    11    12
>>>b    2    3     13    14    15
>>>c    4    5     16    17    18
>>>d    6    7     19    20    21
>>>e    8    9     22    23    24

df6 = pd.DataFrame(np.arange(10, 16).reshape(3, 2), index=['d','e','f'], columns=['two', 'three'])
>>>   two  three
>>>d   10     11
>>>e   12     13
>>>f   14     15

pd.concat([df1, df6], axis=1) # 列合并，行并集
>>>   one  two   two  three
>>>a  0.0  1.0   NaN    NaN
>>>b  2.0  3.0   NaN    NaN
>>>c  4.0  5.0   NaN    NaN
>>>d  6.0  7.0  10.0   11.0
>>>e  8.0  9.0  12.0   13.0
>>>f  NaN  NaN  14.0   15.0

pd.concat([df1, df6], axis=1, join='inner') # 列合并，行交集
>>>   one  two  two  three
>>>d    6    7   10     11
>>>e    8    9   12     13

df1.append(df2) # 行合并
   one  two
a    0    1
b    2    3
c    4    5
d    6    7
e    8    9
f   10   11
g   12   13
h   14   15

df1.append([df2, df3]) # 多个DataFrame行合并
   one  two
a    0    1
b    2    3
c    4    5
d    6    7
e    8    9
f   10   11
g   12   13
h   14   15
i   16   17
j   18   19
k   20   21

df1.append(df6) # 行合并，列交集
   one  two  three
a  0.0    1    NaN
b  2.0    3    NaN
c  4.0    5    NaN
d  6.0    7    NaN
e  8.0    9    NaN
d  NaN   10   11.0
e  NaN   12   13.0
f  NaN   14   15.0

df1.append(df6, ignore_index=True) # 行合并，列交集，重新索引
   one  two  three
0  0.0    1    NaN
1  2.0    3    NaN
2  4.0    5    NaN
3  6.0    7    NaN
4  8.0    9    NaN
5  NaN   10   11.0
6  NaN   12   13.0
7  NaN   14   15.0

索引

Series 和 DataFrame 的索引是 Index 类型，Index 对象是不可修改类型
Index 类型的常用方法：

方法	说明
`idx.append(idx)`	连接另一个 Index 对象，产生新的 Index 对象
`idx.diff(idx)`	计算差集，产生新的 Index 对象
`idx.intersection(idx)`	计算交集，产生新的 Index 对象
`idx.union(idx)`	计算并集，产生新的 Index 对象
`idx.delete(loc)`	删除 loc 位置处的元素，产生新的 Index 对象
`idx.insert(loc, e)`	在 loc 位置处增加一个元素 e，产生新的 Index 对象
`Series/DataFrame.drop(idx.vaule, axis=0/1)`	删除 Series 和 DataFrame 指定行或列索引，axis=1 表示列（默认 axis=0）

数据类型运算

算术运算法则

算术运算根据行列索引，补齐后运算，运算默认产生浮点数。
补齐时缺项填充 NaN (空值)。
二维和一维、一维和零维间为广播运算（低维对象元素会作用到高维对象的每一个元素）。
采用 +、‐、*、/ 符号进行的二元运算产生新的对象。
方法形式的运算可通过指定参数避免上面的 NaN 的产生。

方法	说明
`.add(d, **argws)`	类型间的加法运算，可选参数
`.sub(d, **argws)`	类型间的减法运算，可选参数
`.mul(d, **argws)`	类型间的乘法运算，可选参数
`.div(d, **argws)`	类型间的除法运算，可选参数

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(3).reshape(3,1), index=['a','b','c'], columns=['one'])
>>>   one
>>>a    0
>>>b    1
>>>c    2

df2 = pd.DataFrame(np.arange(10).reshape(5,2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

df1 + df2
>>>   one  two
>>>a  0.0  NaN
>>>b  3.0  NaN
>>>c  6.0  NaN
>>>d  NaN  NaN
>>>e  NaN  NaN

df1.add(df2, fill_value=5) # 先填充后运算
>>>    one   two
>>>a   0.0   6.0
>>>b   3.0   8.0
>>>c   6.0  10.0
>>>d  11.0  12.0
>>>e  13.0  14.0

df3 = pd.Series([5, 10], index=['one', 'two'])
>>>one     5
>>>two    10
>>>dtype: int32

df3 + 5 # 广播运算
>>>one    10
>>>two    15
>>>dtype: int32

df2 + df3 # 不同维度间的广播运算，一维Series默认在轴1参与运算
>>>   one  two
>>>a    5   11
>>>b    7   13
>>>c    9   15
>>>d   11   17
>>>e   13   19

df3 = pd.Series([5, 10, 15], index=['a', 'b', 'c'])
>>>a     5
>>>b    10
>>>c    15
>>>dtype: int32

df2.add(df3, axis=0) # 使用运算方法可以令一维Series在轴0参与运算
>>>    one   two
>>>a   5.0   6.0
>>>b  12.0  13.0
>>>c  19.0  20.0
>>>d   NaN   NaN
>>>e   NaN   NaN

比较运算法则

比较运算只能比较相同索引的元素，不进行补齐。
二维和一维、一维和零维间为广播运算。
采用 >、<、>=、<=、==、!= 等符号进行的二元运算产生布尔对象。

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(15,5,-1).reshape(5,2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a   15   14
>>>b   13   12
>>>c   11   10
>>>d    9    8
>>>e    7    6

df2 = pd.DataFrame(np.arange(10).reshape(5,2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

df1 > df2 #df1和df2必须尺寸一样
>>>     one    two
>>>a   True   True
>>>b   True   True
>>>c   True   True
>>>d   True   True
>>>e  False  False

df3 = pd.Series([5,10,15], index=['a', 'b', 'c'])
>>>one     5
>>>two    10
>>>dtype: int32

df3 > 5
>>>one    False
>>>two     True
>>>dtype: bool

df3 > df2 # 不同维度间的广播运算，一维Series默认在轴1参与运算
>>>     one   two
>>>a   True  True
>>>b   True  True
>>>c   True  True
>>>d  False  True
>>>e  False  True

单列 / 多列 / 分组 / 聚合运算

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5,2), index=['a','b','c','d','e'], columns=['one', 'two'])
>>>   one  two
>>>a    0    1
>>>b    2    3
>>>c    4    5
>>>d    6    7
>>>e    8    9

df.loc[:, 'one'] = df.loc[:, 'one'].map(lambda x: x ** 2) # 在Pandas中，DataFrame的一列就是一个Series, 可以通过map来对一列进行操作
>>>   one  two
>>>a    0    1
>>>b    4    3
>>>c   16    5
>>>d   36    7
>>>e   64    9

def square(x):
    return x ** 2
df.loc[:, 'one'] = df.loc[:, 'one'].map(square) # 其中lambda函数中的x代表当前元素。可以使用另外的函数来代替lambda函数
>>>   one  two
>>>a    0    1
>>>b    4    3
>>>c   16    5
>>>d   36    7
>>>e   64    9

df.loc[:, 'one'] = df.loc[:, 'one'].map(lambda x: True if x >= 5 else False) # lambda中可以传入任何表达式
df.loc[:, 'two'] = df.loc[:, 'two'].map(lambda x: True if x >= 5 else False)
>>>		  one    two
>>>a  False  False
>>>b  False  False
>>>c  False   True
>>>d   True   True
>>>e   True   True

df.loc[:, 'three'] = df.apply(lambda x: x['one'] + 2 * x['two'], axis=1) # 要对DataFrame的多个列同时进行运算，可以使用apply
>>>   one  two  three
>>>a    0    1      2
>>>b    2    3      8
>>>c    4    5     14
>>>d    6    7     20
>>>e    8    9     26

df.loc['f', :] = df.apply(lambda x: x['a'] + 2 * x['b'], axis=0) #对DataFrame的多个行同时进行运算，将axis设为0
>>>   one  two
>>>a  0.0  1.0
>>>b  2.0  3.0
>>>c  4.0  5.0
>>>d  6.0  7.0
>>>e  8.0  9.0
>>>f  4.0  7.0

# 要对DataFrame的每个元素同时进行运算，可以使用applymap
df = df.applymap(lambda x: x ** 2 if x <= 5 else x * 2)
>>>   one  two
>>>a    0    1
>>>b    4    9
>>>c   16   25
>>>d   12   14
>>>e   16   18

Pandas 的数据特征分析

排序

.sort_index(axis=0, ascending=True) 方法在指定轴上根据索引进行排序，默认升序。
Series.sort_values(axis=0, ascending=True)、DataFrame.sort_values(by, axis=0, ascending=True) 方法在指定轴上根据数值进行排序，默认升序。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5,2), index=['c','b','a','e','d'], columns=['two', 'one'])
>>>   two  one
>>>c    0    1
>>>b    2    3
>>>a    4    5
>>>e    6    7
>>>d    8    9

df.sort_index() # 根据indexes顺排（首字母顺序）
>>>   two  one
>>>a    4    5
>>>b    2    3
>>>c    0    1
>>>d    8    9
>>>e    6    7

df.sort_index(ascending=False) # 根据indexes逆排（首字母顺序）
>>>   two  one
>>>e    6    7
>>>d    8    9
>>>c    0    1
>>>b    2    3
>>>a    4    5

df.sort_index(axis=1) # 根据columns顺排（首字母顺序）
>>>   one  two
>>>c    1    0
>>>b    3    2
>>>a    5    4
>>>e    7    6
>>>d    9    8

df.sort_values('two', ascending=False) # 根据'two'这一列的值重排所有的数据
>>>   two  one
>>>d    8    9
>>>e    6    7
>>>a    4    5
>>>b    2    3
>>>c    0    1

df.sort_values('c', axis=1, ascending=False)
>>>   one  two
>>>c    1    0
>>>b    3    2
>>>a    5    4
>>>e    7    6
>>>d    9    8

注意：排序时，NaN永远都是在排序结果末尾（不管是升序还是降序）

统计

适用于 Series 和 DataFrame 类型数据，基本统计分析

方法	说明
.sum()	计算数据的总和，按 0 轴计算，下同
.count()	非 NaN 值的数量
.mean() .median()	计算数据的算术平均值、算术中位数
.var() .std()	计算数据的方差、标准差
.min() .max()	计算数据的最小值、最大值
.describe()	针对 0 轴（各列）的统计汇总

适用于 Series 类型，基本统计分析

方法	说明
.argmin() .argmax()	计算数据最大值、最小值所在位置的索引位置（自动索引）
.idxmin() .idxmax()	计算数据最大值、最小值所在位置的索引位置（自定义索引）

适用于 Series 和 DataFrame 类型，累计计算

方法	说明
.cumsum()	依次给出前 1、2、…、n 个数的和
.cumprod()	依次给出前 1、2、…、n 个数的积
.cummax()	依次给出前 1、2、…、n 个数的最大值
.cummin()	依次给出前 1、2、…、n 个数的最小值

适用于 Series 和 DataFrame 类型，滚动计算（窗口计算）

方法	说明
.rolling(w).sum()	依次计算相邻 w 个元素的和
.rolling(w).mean()	依次计算相邻 w 个元素的算术平均值
.rolling(w).var()	依次计算相邻 w 个元素的方差
.rolling(w).std()	依次计算相邻 w 个元素的标准差
.rolling(w).min() .max()	依次计算相邻 w 个元素的最小值和最大值

适用于 Series 和 DataFrame 类型，相关性分析

方法	说明
.cov()	计算协方差矩阵
.corr()	计算相关系数矩阵，Pearson、Spearman、Kendall 等系数

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5,2), index=['a','b','c','d','e'], columns=['one', 'two'])
df['one'] = [1, 1, 1, 2, 2]
>>>   one  two
>>>a    1    1
>>>b    1    3
>>>c    1    5
>>>d    2    7
>>>e    2    9

df.groupby('one')['two'].sum() # 按照列one进行分组对列two进行统计求和，返回Series类型
>>>one
>>>1     9
>>>2    16
>>>Name: two, dtype: int32
    
df.groupby([3, 3, 4, 4, 4]).sum() # 可以指定分组
>>>   one  two
>>>3    2    4
>>>4    5   21

df.groupby('one')['two'].describe() # 按照列one进行分组对列two进行统计计数，返回DataFrame类型
>>>     count  mean       std  min  25%  50%  75%  max
>>>one                                                
>>>1      3.0   3.0  2.000000  1.0  2.0  3.0  4.0  5.0
>>>2      2.0   8.0  1.414214  7.0  7.5  8.0  8.5  9.0

df['three'] = df.groupby('one')['two'].transform(lambda x: (x.sum() - x) / x.count()) # 按照列one进行分组对列two进行函数运算
>>>   one  two     three
>>>a    1    1  2.666667
>>>b    1    3  2.000000
>>>c    1    5  1.333333
>>>d    2    7  4.500000
>>>e    2    9  3.500000

df.groupby('one').agg(['sum', 'count', 'mean', 'median', 'var', 'std', 'min', 'max', 'first', 'last']) # agg方法将一个函数使用在一个数列上，然后返回一个标量的值。内置函数名需要用引号
>>>    two                                                   
>>>    sum count mean median var       std min max first last
>>>one                                                       
>>>1     9     3    3      3   4  2.000000   1   5     1    5
>>>2    16     2    8      8   2  1.414214   7   9     7    9

df.agg(['sum', 'count', 'mean', 'median', 'var', 'std', 'min', 'max', 'first', 'last'])
>>>             one        two
>>>sum     7.000000  25.000000
>>>count   5.000000   5.000000
>>>mean    1.400000   5.000000
>>>median  1.000000   5.000000
>>>var     0.300000  10.000000
>>>std     0.547723   3.162278
>>>min     1.000000   1.000000
>>>max     2.000000   9.000000