基本功能

索引重排

Series 重排

>>> obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
>>> obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

使用 reindex 方法将会根据新索引进行数据重排；如果某个索引值当前不存在，就引入缺失值。

>>> obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
>>> obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

DataFrame 重排

>>> frame = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],
                         columns=['Ohio', 'Texas', 'California'])
>>> frame
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

>>> frame2 = frame.reindex(['a', 'b', 'c', 'd'])
>>> frame2
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

列索引重排

>>> states = ['Texas', 'Utah', 'California']
>>> frame.reindex(columns=states)
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

通过索引删除数据

Series drop

>>> obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
>>> obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
    
>>> new_obj = obj.drop('c')
>>> new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
    
>>> obj.drop(['d', 'c'])
a    0.0
b    1.0
e    4.0
dtype: float64

caution

drop 方法会返回一个删除后的新对象。

DataFrame drop

>>> data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
>>> data.drop(['Colorado', 'Ohio'])
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

通过传递 axis=1 或 axis='columns' 可以删除列的值：

>>> data.drop('two', axis=1)
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15
>>> data.drop(['two', 'four'], axis='columns')
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

info

使用 inplace 可以在原始数据上直接进行删除操作：

>>> obj.drop('c', inplace=True)
>>> obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

danger

小心使用 inplace，它会销毁所有被删除的数据。

索引、选取和过滤

索引

>>> obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
>>> obj
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
    
>>> obj['b']
1.0

>>> obj[1]
1.0

>>> obj[2:4]
c    2.0
d    3.0
dtype: float64
    
>>> obj[['b', 'a', 'd']]
b    1.0
a    0.0
d    3.0
dtype: float64
    
>>> obj[[1, 3]]
b    1.0
d    3.0
dtype: float64
    
>>> obj[obj < 2]
a    0.0
b    1.0
dtype: float64

索引标签切片

>>> obj['b':'c']
b    1.0
c    2.0
dtype: float64

caution

通过 索引标签 进行的切片运算与普通的 Python 切片运算不同，其末端是包含的。

用索引标签切片可以对 Series 的相应部分进行设置：

>>> obj['b':'c'] = 5
>>> obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

DataFrame 索引

>>> data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                        index=['Ohio', 'Colorado', 'Utah', 'New York'],
                        columns=['one', 'two', 'three', 'four'])
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

>>> data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
        
>>> data[['three', 'one']]
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

特殊情况

DataFrame 中使用切片 data[:2] 是用来选取行而不是列！。只有向 [] 中传入单个标签或列表才是选择列。

>>> data[:2]
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

布尔选取数据

>>> data[data['three'] > 5]
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

>>> data < 5
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

>>> data[data < 5] = 0
>>> data
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

loc、iloc 选取

>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

loc

>>> data.loc['Colorado', ['two', 'three']]
two      5
three    6
Name: Colorado, dtype: int64

iloc

>>> data.iloc[2, [3, 0, 1]]
four    11
one      8
two      9
Name: Utah, dtype: int64
        
>>> data.iloc[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
        
>>> data.iloc[[1, 2], [3, 0, 1]]
          four  one  two
Colorado     7    4    5
Utah        11    8    9

这两个索引函数也适用于一个标签或多个标签的切片

>>> data.loc[:'Utah', 'two']
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64
        
>>> data.iloc[:, :3][data.three > 5]
          one  two  three
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14

类型	说明
df[val]	从 DataFrame 选取单列或一组列；在特殊情况下比较便利：布尔型数组（过滤行）、切片（行切片）、或布尔型 DataFrame （根据条件设置值）
df.loc[val]	选取 Dataframe 的单个行或一组行
df.loc[:, val]	选取单列或列子集
df.iloc[where]	通过整数位置，选取 Dataframe 的单个行或一组行
df.iloc[:, where]	通过整数位置，选取 Dataframe 单列或列子集
df.iloc[where_i, where_j]	通过整数位置，同时选取行和列

算术运算和数据对齐

数据对齐

pandas 最重要的一个功能是，它可以对不同索引的对象进行算术运算。在将对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集。

>>> s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
>>> s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], 
                   index=['a', 'c', 'e', 'f', 'g'])
>>> s1
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
    
>>> s2
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

相加：

>>> s1 + s2
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

info

索引无法对齐时，引入 na 值。

DataFrame

>>> df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), 
                       index=['Ohio', 'Texas', 'Colorado'])
>>> df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), 
                       index=['Utah', 'Ohio', 'Texas', 'Oregon'])
>>> df1
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
>>> df2
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

>>> df1 + df2
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

如果 DataFrame 对象相加，没有共用的列或行标签，结果都会是空

>>> df1 = pd.DataFrame({'A': [1, 2]})
>>> df2 = pd.DataFrame({'B': [3, 4]})
>>> df1
   A
0  1
1  2
>>> df2
   B
0  3
1  4
>>> df1 + df2
    A   B
0 NaN NaN
1 NaN NaN

在算术运算中填充 na 值

在对不同索引的对象进行算术运算时，当索引无法对齐时填充 na 值（比如0）：

>>> df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
>>> df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
>>> df2.loc[1, 'b'] = np.nan

>>> df1
     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

>>> df2
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

相加时，索引无法对齐就会产生 na 值

>>> df1 + df2
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

使用 add 方法传入 fill_value 参数

>>> df1.add(df2, fill_value=0)
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0   5.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

常用算术方法

字母 r 表示参数翻转。

方法	说明
add、radd	+
sub、rsub	-
div、rdiv	/
floordiv、rfloordiv	// 底除
mul、rmul	*
pow、rpow	** 指数

DataFrame和Series之间的运算

>>> frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                         columns=list('bde'),
                         index=['Utah', 'Ohio', 'Texas', 'Oregon'])
>>> series = frame.iloc[0]
>>> frame
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

>>> series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
        
>>> frame - series
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

info

默认情况下，DataFrame 和 Series 之间的算术运算会将 Series 的索引匹配到 DataFrame 的列，然后沿着行一直向下广播。

>>> series2 = pd.Series(range(3), index=['b', 'e', 'f'])
>>> series2
b    0
e    1
f    2
dtype: int64

>>> frame + series2
          b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaN

info

如果某个索引值在 DataFrame 的列或 Series 的索引中找不到，则参与运算的两个对象就会被重新索引以形成并集。

如果你希望匹配行且在列上广播，则必须使用算术运算方法。

>>> frame
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

>>> series3 = frame['d']
>>> series3
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

>>> frame.sub(series3, axis='index')
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

info

传入的轴号就是希望匹配的轴。在本例中，我们的目的是匹配 DataFrame 的行索引（axis='index' or axis=0）并进行广播。

函数应用

apply

apply 作用于 DataFrame，可以对行/列操作：

>>> frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
                         index=['Utah', 'Ohio', 'Texas', 'Oregon'])
>>> frame
               b         d         e
Utah    0.472846  0.209272  0.675234
Ohio   -0.625643 -0.693833  1.309783
Texas   1.093068  0.654257  0.721414
Oregon  0.544472 -1.931233 -1.925308

>>> f = lambda x: x.max() - x.min()

>>> frame.apply(f)
b    1.718711
d    2.585490
e    3.235091
dtype: float64

info

这里的函数 f，计算了一个 Series 的最大值和最小值的差，在 frame 的每列都执行了一次。结果是一个 Series，使用 frame 的列作为索引。

如果传递 axis='columns' 或者 axis=1 到 apply，这个函数会在每行执行：

>>> frame.apply(f, axis='columns')
Utah      0.465962
Ohio      2.003616
Texas     0.438811
Oregon    2.475704
dtype: float64

传递到 apply 的函数不是必须返回一个标量，还可以返回由多个值组成的 Series：

>>> def f(x):
        return pd.Series([x.min(), x.max()], index=['min', 'max'])

>>> frame.apply(f)
            b         d         e
min -0.625643 -1.931233 -1.925308
max  1.093068  0.654257  1.309783

applymap

如果想让函数作用于DataFrame 中的每一个元素，可以使用 applymap ：

>>> frame
               b         d         e
Utah    0.472846  0.209272  0.675234
Ohio   -0.625643 -0.693833  1.309783
Texas   1.093068  0.654257  0.721414
Oregon  0.544472 -1.931233 -1.925308

>>> format = lambda x: '%.2f' % x

>>> frame.applymap(format)
            b      d      e
Utah     0.47   0.21   0.68
Ohio    -0.63  -0.69   1.31
Texas    1.09   0.65   0.72
Oregon   0.54  -1.93  -1.93

map

map只要是作用于一个 Series 的每一个元素，用法如下所示：

>>> frame['e'].map(format)
Utah       0.68
Ohio       1.31
Texas      0.72
Oregon    -1.93
Name: e, dtype: object

排序

sort_index

>>> obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
>>> obj
d    0
a    1
b    2
c    3
dtype: int64

>>> obj.sort_index()
a    1
b    2
c    3
d    0
dtype: int64

DataFrame

>>> frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 
                         index=['three', 'one'], 
                         columns=['d', 'a', 'b', 'c'])
>>> frame
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

>>> frame.sort_index()
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

>>> frame.sort_index(axis=1)
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

>>> frame.sort_index(axis=1, ascending=False)
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

sort_values

>>> obj = pd.Series([4, 7, -3, 2])

>>> obj.sort_values()
2   -3
3    2
0    4
1    7
dtype: int64

DataFrame

>>> frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
>>> frame
   b  a
0  4  0
1  7  1
2 -3  0
3  2  1

>>> frame.sort_values(by='b')
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1

>>> frame.sort_values(by=['a', 'b'])
   b  a
2 -3  0
0  4  0
3  2  1
1  7  1

汇总和统计

>>> df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],
                      index=['a', 'b', 'c', 'd'],
                      columns=['one', 'two'])
>>> df
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

>>> df.sum()
one    9.25
two   -5.80
dtype: float64

传入 axis='columns' 或 axis=1 将会按行进行求和运算：

>>> df.sum(axis=1)
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

info

na 值默认会被跳过，除非整个切片（这里指的是行或列）都是 na。通过 skipna 选项可以禁用该功能：

>>> df.mean(axis='columns')
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

>>> df.mean(axis='columns', skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

唯一值 uniques

>>> obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
>>> uniques = obj.unique()
>>> uniques
array(['c', 'a', 'd', 'b'], dtype=object)

值计数 value_counts

>>> obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
>>> obj.value_counts()
a    3
c    3
b    2
d    1
dtype: int64

tip

value_counts 还是一个顶级 pandas 方法，可用于任何数组或序列：

>>> pd.value_counts(obj.values, sort=False)
c    3
b    2
d    1
a    3
dtype: int64

成员资格 isin

>>> obj
  c
  a
  d
  a
  a
  b
  b
  c
  c
dtype: object
    
>>> mask = obj.isin(['b', 'c'])
>>> mask
   True
  False
  False
  False
  False
   True
   True
   True
   True
dtype: bool
    
>>> obj[mask]
  c
  b
  b
  c
  c
dtype: object

索引重排​

Series 重排​

DataFrame 重排​

通过索引删除数据​

Series drop​

DataFrame drop​

索引、选取和过滤​

索引​

索引标签切片​

DataFrame 索引​

loc、iloc 选取​

算术运算和数据对齐​

数据对齐​

在算术运算中填充 na 值​

DataFrame和Series之间的运算​

函数应用​

apply​

applymap​

map​

排序​

sort_index​

sort_values​

汇总和统计​

唯一值 uniques​

值计数 value_counts​

成员资格 isin​

索引重排

Series 重排

DataFrame 重排

通过索引删除数据

Series drop

DataFrame drop

索引、选取和过滤

索引

索引标签切片

DataFrame 索引

loc、iloc 选取

算术运算和数据对齐

数据对齐

在算术运算中填充 na 值

DataFrame和Series之间的运算

函数应用

apply

applymap

map

排序

sort_index

sort_values

汇总和统计

唯一值 uniques

值计数 value_counts

成员资格 isin