基本功能
索引重排
Series 重排
>>> obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
>>> obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
使用 reindex 方法将会根据新索引进行数据重排;如果某个索引值当前不存在,就引入缺失值。
>>> obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
>>> obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
DataFrame 重排
>>> frame = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
>>> frame
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
>>> frame2 = frame.reindex(['a', 'b', 'c', 'd'])
>>> frame2
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
>>> states = ['Texas', 'Utah', 'California']
>>> frame.reindex(columns=states)
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
通过索引删除数据
Series drop
>>> obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
>>> obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
>>> new_obj = obj.drop('c')
>>> new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
>>> obj.drop(['d', 'c'])
a 0.0
b 1.0
e 4.0
dtype: float64
drop 方法会返回一个删除后的新对象。
DataFrame drop
>>> data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
>>> data.drop(['Colorado', 'Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15
通过传递 axis=1 或 axis='columns' 可以删除列的值:
>>> data.drop('two', axis=1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
>>> data.drop(['two', 'four'], axis='columns')
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
使用 inplace 可以在原始数据上直接进行删除操作:
>>> obj.drop('c', inplace=True)
>>> obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
小心使用 inplace,它会销毁所有被删除的数据。
索引、选取和过滤
索引
>>> obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
>>> obj
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
>>> obj['b']
1.0
>>> obj[1]
1.0
>>> obj[2:4]
c 2.0
d 3.0
dtype: float64
>>> obj[['b', 'a', 'd']]
b 1.0
a 0.0
d 3.0
dtype: float64
>>> obj[[1, 3]]
b 1.0
d 3.0
dtype: float64
>>> obj[obj < 2]
a 0.0
b 1.0
dtype: float64
索引标签切片
>>> obj['b':'c']
b 1.0
c 2.0
dtype: float64
通过 索引标签 进行的切片运算与普通的 Python 切片运算不同,其末端是包含的。
用索引标签切片可以对 Series 的相应部分进行设置:
>>> obj['b':'c'] = 5
>>> obj
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
DataFrame 索引
>>> data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
>>> data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
>>> data[['three', 'one']]
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
DataFrame 中使用切片 data[:2] 是用来选取行而不是列!。只有向 [] 中传入单个标签或列表才是选择列。
>>> data[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
>>> data[data['three'] > 5]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
>>> data < 5
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
>>> data[data < 5] = 0
>>> data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
loc、iloc 选取
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
>>> data.loc['Colorado', ['two', 'three']]
two 5
three 6
Name: Colorado, dtype: int64
>>> data.iloc[2, [3, 0, 1]]
four 11
one 8
two 9
Name: Utah, dtype: int64
>>> data.iloc[2]
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
>>> data.iloc[[1, 2], [3, 0, 1]]
four one two
Colorado 7 4 5
Utah 11 8 9
>>> data.loc[:'Utah', 'two']
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int64
>>> data.iloc[:, :3][data.three > 5]
one two three
Colorado 4 5 6
Utah 8 9 10
New York 12 13 14
类型 | 说明 |
---|---|
df[val] | 从 DataFrame 选取单列或一组列;在特殊情况下比较便利:布尔型数组(过滤行)、切片(行切片)、或布尔型 DataFrame (根据条件设置值) |
df.loc[val] | 选取 Dataframe 的单个行或一组行 |
df.loc[:, val] | 选取单列或列子集 |
df.iloc[where] | 通过整数位置,选取 Dataframe 的单个行或一组行 |
df.iloc[:, where] | 通过整数位置,选取 Dataframe 单列或列子集 |
df.iloc[where_i, where_j] | 通过整数位置,同时选取行和列 |
算术运算和数据对齐
数据对齐
pandas 最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
>>> s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
>>> s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
index=['a', 'c', 'e', 'f', 'g'])
>>> s1
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
>>> s2
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
相加:
>>> s1 + s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
索引无法对齐时,引入 na 值。
>>> df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
>>> df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
>>> df1
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
>>> df2
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
>>> df1 + df2
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
>>> df1 = pd.DataFrame({'A': [1, 2]})
>>> df2 = pd.DataFrame({'B': [3, 4]})
>>> df1
A
0 1
1 2
>>> df2
B
0 3
1 4
>>> df1 + df2
A B
0 NaN NaN
1 NaN NaN
在算术运算中填充 na 值
在对不同索引的对象进行算术运算时,当索引无法对齐时填充 na 值(比如0):
>>> df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
>>> df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
>>> df2.loc[1, 'b'] = np.nan
>>> df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
>>> df2
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
>>> df1 + df2
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
>>> df1.add(df2, fill_value=0)
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
字母 r 表示参数翻转。
方法 | 说明 |
---|---|
add、radd | + |
sub、rsub | - |
div、rdiv | / |
floordiv、rfloordiv | // 底除 |
mul、rmul | * |
pow、rpow | ** 指数 |
DataFrame和Series之间的运算
>>> frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
>>> series = frame.iloc[0]
>>> frame
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
>>> series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
>>> frame - series
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
默认情况下,DataFrame 和 Series 之间的算术运算会将 Series 的索引匹配到 DataFrame 的列,然后沿着行一直向下广播。
>>> series2 = pd.Series(range(3), index=['b', 'e', 'f'])
>>> series2
b 0
e 1
f 2
dtype: int64
>>> frame + series2
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
如果某个索引值在 DataFrame 的列或 Series 的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集。
如果你希望匹配行且在列上广播,则必须使用算术运算方法。
>>> frame
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
>>> series3 = frame['d']
>>> series3
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
>>> frame.sub(series3, axis='index')
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
传入的轴号就是希望匹配的轴。在本例中,我们的目的是匹配 DataFrame 的行索引(axis='index' or axis=0)并进行广播。
函数应用
apply
apply 作用于 DataFrame, 可以对行/列操作:
>>> frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
>>> frame
b d e
Utah 0.472846 0.209272 0.675234
Ohio -0.625643 -0.693833 1.309783
Texas 1.093068 0.654257 0.721414
Oregon 0.544472 -1.931233 -1.925308
>>> f = lambda x: x.max() - x.min()
>>> frame.apply(f)
b 1.718711
d 2.585490
e 3.235091
dtype: float64
这里的函数 f,计算了一个 Series 的最大值和最小值的差,在 frame 的每列都执行了一次。结果是一个 Series,使用 frame 的列作为索引。
如果传递 axis='columns' 或者 axis=1 到 apply,这个函数会在每行执行:
>>> frame.apply(f, axis='columns')
Utah 0.465962
Ohio 2.003616
Texas 0.438811
Oregon 2.475704
dtype: float64
传递到 apply 的函数不是必须返回一个标量,还可以返回由多个值组成的 Series:
>>> def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
>>> frame.apply(f)
b d e
min -0.625643 -1.931233 -1.925308
max 1.093068 0.654257 1.309783
applymap
如果想让函数作用于DataFrame 中的每一个元素,可以使用 applymap :
>>> frame
b d e
Utah 0.472846 0.209272 0.675234
Ohio -0.625643 -0.693833 1.309783
Texas 1.093068 0.654257 0.721414
Oregon 0.544472 -1.931233 -1.925308
>>> format = lambda x: '%.2f' % x
>>> frame.applymap(format)
b d e
Utah 0.47 0.21 0.68
Ohio -0.63 -0.69 1.31
Texas 1.09 0.65 0.72
Oregon 0.54 -1.93 -1.93
map
map只要是作用于一个 Series 的每一个元素,用法如下所示:
>>> frame['e'].map(format)
Utah 0.68
Ohio 1.31
Texas 0.72
Oregon -1.93
Name: e, dtype: object
排序
sort_index
>>> obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
>>> obj
d 0
a 1
b 2
c 3
dtype: int64
>>> obj.sort_index()
a 1
b 2
c 3
d 0
dtype: int64
>>> frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
>>> frame
d a b c
three 0 1 2 3
one 4 5 6 7
>>> frame.sort_index()
d a b c
one 4 5 6 7
three 0 1 2 3
>>> frame.sort_index(axis=1)
a b c d
three 1 2 3 0
one 5 6 7 4
>>> frame.sort_index(axis=1, ascending=False)
d c b a
three 0 3 2 1
one 4 7 6 5
sort_values
>>> obj = pd.Series([4, 7, -3, 2])
>>> obj.sort_values()
2 -3
3 2
0 4
1 7
dtype: int64
>>> frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
>>> frame
b a
0 4 0
1 7 1
2 -3 0
3 2 1
>>> frame.sort_values(by='b')
b a
2 -3 0
3 2 1
0 4 0
1 7 1
>>> frame.sort_values(by=['a', 'b'])
b a
2 -3 0
0 4 0
3 2 1
1 7 1
汇总和统计
>>> df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
>>> df
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
>>> df.sum()
one 9.25
two -5.80
dtype: float64
传入 axis='columns' 或 axis=1 将会按行进行求和运算:
>>> df.sum(axis=1)
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
na 值默认会被跳过,除非整个切片(这里指的是行或列)都是 na。通过 skipna 选项可以禁用该功能:
>>> df.mean(axis='columns')
a 1.400
b 1.300
c NaN
d -0.275
dtype: float64
>>> df.mean(axis='columns', skipna=False)
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
唯一值 uniques
>>> obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
>>> uniques = obj.unique()
>>> uniques
array(['c', 'a', 'd', 'b'], dtype=object)
值计数 value_counts
>>> obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
>>> obj.value_counts()
a 3
c 3
b 2
d 1
dtype: int64
value_counts 还是一个顶级 pandas 方法,可用于任何数组或序列:
>>> pd.value_counts(obj.values, sort=False)
c 3
b 2
d 1
a 3
dtype: int64
成员资格 isin
>>> obj
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object
>>> mask = obj.isin(['b', 'c'])
>>> mask
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
>>> obj[mask]
0 c
5 b
6 b
7 c
8 c
dtype: object