数据清洗

处理缺失数据

缺失数据

在 pandas 中，缺失值通常表示为 na，它表示 not available。对于数值数据来说，pandas 使用浮点值 nan（Not a Number）表示缺失数据。

>>> string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

>>> string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
    
>>> string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool

Python 内置的 None 值在对象数组中也可以作为 NA

>>> string_data[0] = None

>>> string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool

过滤 dropna

>>> from numpy import nan as NA
>>> data = pd.Series([1, NA, 3.5, NA, 7])

>>> data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
    
>>> data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64

等价于

>>> data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64

对于 DataFrame，dropna 默认丢弃任何含有缺失值的行

>>> data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
>>> data
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

>>> cleaned = data.dropna()
>>> cleaned
     0    1    2
0  1.0  6.5  3.0

传入 how='all' 将只丢弃全为NA的那些行

>>> data.dropna(how='all')
  1    2
1.0  6.5  3.0
1.0  NaN  NaN
NaN  6.5  3.0

如果要过滤列，只需传入 axis=1 即可

>>> data[4] = NA
>>> data
  1    2   4
1.0  6.5  3.0 NaN
1.0  NaN  NaN NaN
NaN  NaN  NaN NaN
NaN  6.5  3.0 NaN
>>> data.dropna(axis=1, how='all')
  1    2
1.0  6.5  3.0
1.0  NaN  NaN
NaN  NaN  NaN
NaN  6.5  3.0

填充 fillna

>>> df
          0         1         2
0.718710       NaN       NaN
0.219162       NaN       NaN
-1.670341       NaN -1.951170
-0.362809       NaN -1.309137
-1.006110 -0.164043 -0.892337
1.226407  1.449656  1.484279
-0.523763  1.484114 -0.207278

>>> df.fillna(0)
          0         1         2
0.718710  0.000000  0.000000
0.219162  0.000000  0.000000
-1.670341  0.000000 -1.951170
-0.362809  0.000000 -1.309137
-1.006110 -0.164043 -0.892337
1.226407  1.449656  1.484279
-0.523763  1.484114 -0.207278

不同列填充不同值

>>> df.fillna({1: 0.5, 2: 0})
          0         1         2
0.718710  0.500000  0.000000
0.219162  0.500000  0.000000
-1.670341  0.500000 -1.951170
-0.362809  0.500000 -1.309137
-1.006110 -0.164043 -0.892337
1.226407  1.449656  1.484279
-0.523763  1.484114 -0.207278

对现有对象直接修改

>>> df.fillna(0, inplace=True)
>>> df
          0         1         2
0.718710  0.000000  0.000000
0.219162  0.000000  0.000000
-1.670341  0.000000 -1.951170
-0.362809  0.000000 -1.309137
-1.006110 -0.164043 -0.892337
1.226407  1.449656  1.484279
-0.523763  1.484114 -0.207278

数据转换

移出重复数据

>>> data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                         'k2': [1, 1, 2, 3, 3, 4, 4]})
>>> data
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

duplicated 表示各行是否是重复行

>>> data.duplicated()
  False
  False
  False
  False
  False
  False
   True
dtype: bool

drop_duplicates 默认判断全部的列

>>> data.drop_duplicates()
    k1  k2
one   1
two   1
one   2
two   3
one   3
two   4

drop_duplicates 指定部分列进行重复项判断

>>> data['v1'] = range(7)
>>> data
    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
5  two   4   5
6  two   4   6

>>> data.drop_duplicates(['k1'])
    k1  k2  v1
0  one   1   0
1  two   1   1

tip

drop_duplicates 默认保留的是第一个出现的值组合。传入 keep='last' 则保留最后一个。

>>> data.drop_duplicates(['k1', 'k2'], keep='last')
    k1  k2  v1
one   1   0
two   1   1
one   2   2
two   3   3
one   3   4
two   4   6

函数映射

>>> data = pd.DataFrame({'food': ['bacon', 'pulled pork', 
                                  'bacon','Pastrami', 'corned beef', 
                                  'Bacon','pastrami', 'honey ham', 'nova lox'],
                         'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
>>> data
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

肉类到动物的映射：

>>> meat_to_animal = {
...   'bacon': 'pig',
...   'pulled pork': 'pig',
...   'pastrami': 'cow',
...   'corned beef': 'cow',
...   'honey ham': 'pig',
...   'nova lox': 'salmon'
... }

统一大小写

>>> lowercased = data['food'].str.lower()
>>> lowercased
        bacon
  pulled pork
        bacon
     pastrami
  corned beef
        bacon
     pastrami
    honey ham
     nova lox
Name: food, dtype: object

>>> data['animal'] = lowercased.map(meat_to_animal)
>>> data
          food  ounces  animal
      bacon     4.0     pig
pulled pork     3.0     pig
      bacon    12.0     pig
   Pastrami     6.0     cow
corned beef     7.5     cow
      Bacon     8.0     pig
   pastrami     3.0     cow
  honey ham     5.0     pig
   nova lox     6.0  salmon

lambda

>>> data['food'].map(lambda x: meat_to_animal[x.lower()])
     pig
     pig
     pig
     cow
     cow
     pig
     cow
     pig
  salmon
Name: food, dtype: object

替换值 replace

>>> data = pd.Series([1., -999., 2., -999., -1000., 3.])
>>> data
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

替换 -999

>>> data.replace(-999, np.nan)
     1.0
     NaN
     2.0
     NaN
 -1000.0
     3.0
dtype: float64

替换多个值

>>> data.replace([-999, -1000], np.nan)
  1.0
  NaN
  2.0
  NaN
  NaN
  3.0
dtype: float64

替换不同的值

>>> data.replace([-999, -1000], np.nan)
  1.0
  NaN
  2.0
  NaN
  NaN
  3.0
dtype: float64

索引重命名

>>> data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                        index=['Ohio', 'Colorado', 'New York'],
                        columns=['one', 'two', 'three', 'four'])
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

>>> transform = lambda x: x[:4].upper()

>>> data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')

直接赋值修改

>>> data.index = data.index.map(transform)
>>> data
      one  two  three  four
OHIO    0    1      2     3
COLO    4    5      6     7
NEW     8    9     10    11

rename

data.rename(index=str.title, columns=str.upper)
      ONE  TWO  THREE  FOUR
Ohio    0    1      2     3
Colo    4    5      6     7
New     8    9     10    11

rename还可以结合字典对部分轴标签的更新

>>> data.rename(index={'OHIO': 'INDIANA'},
                columns={'three': 'peekaboo'})
         one  two  peekaboo  four
INDIANA    0    1         2     3
COLO       4    5         6     7
NEW        8    9        10    11

rename 直接修改

>>> data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
>>> data
         one  two  three  four
INDIANA    0    1      2     3
COLO       4    5      6     7
NEW        8    9     10    11

离散化

>>> ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

划分

>>> bins = [18, 25, 35, 60, 100]
>>> cats = pd.cut(ages, bins)
>>> cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

统计

>>> cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

>>> cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')
                                    
>>> pd.value_counts(cats)
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

调整闭端

>>> pd.cut(ages, [18, 26, 36, 61, 100], right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

设置区间名称

>>> group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
>>> pd.cut(ages, bins, labels=group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

字符串操作

字符串方法

对于许多字符串处理和脚本应用，内置的字符串方法已经能够满足要求了。例如，以逗号分隔的字符串可以用 split 拆分成数段：

>>> val = 'a,b,  guido'
>>> val.split(',')
['a', 'b', '  guido']

tip

split 常常与 strip 一起使用，以去除空白符（包括换行符）：

>>> pieces = [x.strip() for x in val.split(',')]
>>> pieces
['a', 'b', 'guido']

利用加法，可以将这些子字符串以双冒号分隔符的形式连接起来：

>>> first, second, third = pieces
>>> first + '::' + second + '::' + third
'a::b::guido'

join

>>> '::'.join(pieces)
'a::b::guido'

检测子串的最佳方式是利用 Python 的 in 关键字，还可以使用 index 和 find：

>>> val
'a,b,  guido'

>>> 'guido' in val
True
>>> val.index(',')
1
>>> val.find(':')
-1

caution

注意 find 和 index 的区别：如果找不到字符串，index 将会引发一个异常（而不是返回－1）。

>>> val.index(':')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

count 可以返回指定子串的出现次数：

>>> val.count(',')
2

replace 传入空字符串，可以用于删除：

>>> val.replace(',', '::')
'a::b::  guido'
>>> val.replace(',', '')
'ab  guido'

正则表达式

假设我想要拆分一个字符串，分隔符为数量不定的一组空白符（制表符、空格、换行符等）。描述一个或多个空白符的 regex 是 \s+：

>>> import re
>>> text = "foo    bar\t baz  \tqux"
>>> re.split('\s+', text)
['foo', 'bar', 'baz', 'qux']

regex

>>> regex = re.compile('\s+')
>>> regex.split(text)
['foo', 'bar', 'baz', 'qux']

如果只希望得到匹配 regex 的所有模式，则可以使用 findall 方法：

>>> regex.findall(text)
['    ', '\t ', '  \t']

info

match 和 search 跟 findall 功能类似。findall 返回的是字符串中所有的匹配项，而 search 则只返回第一个匹配项。match 更加严格，它只匹配字符串的首部。

>>> text = """Dave dave@google.com
... Steve steve@gmail.com
... Rob rob@gmail.com
... Ryan ryan@yahoo.com
... """
>>> text
'Dave dave@google.com\nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'

>>> pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)

note

对 text 使用 findall 将得到一组电子邮件地址：

>>> regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

note

search 返回的是文本中第一个电子邮件地址（以特殊的匹配项对象形式返回）。对于上面那个 regex，匹配项对象能告诉我们模式在原字符串中的起始和结束位置：

>>> m = regex.search(text)
>>> m
<re.Match object; span=(5, 20), match='dave@google.com'>
>>> text[m.start():m.end()]
'dave@google.com'
>>> m.group()
'dave@google.com'

note

regex.match 则将返回 None，因为它只匹配出现在字符串开头的模式：

>>> print(regex.match(text))
None

note

regex.sub 方法可以将匹配到的模式替换为指定字符串，并返回所得到的新字符串：

>>> print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

字符串函数

清理待分析的散乱数据时，常常需要做一些字符串规整化工作。更为复杂的情况是，含有字符串的列有时还含有缺失数据：

>>> data = {'Dave': 'dave@google.com', 'Steve': 
            'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}
>>> data = pd.Series(data)
>>> data
Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object
    
>>> data.isnull()
Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

通过 data.map，所有字符串和正则表达式方法都能被应用于（传入 lambda 表达式或其他函数）各个值，但是如果存在 NA（null）就会报错。为了解决这个问题，Series 有一些能够跳过 NA 值的面向数组方法，进行字符串操作。通过 Series 的 str 属性即可访问这些方法。

note

通过 str.contains 检查各个电子邮件地址是否含有 "gmail"：

>>> data.str.contains('gmail')
Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

note

使用正则表达式：

>>> pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
>>> data.str.findall(pattern, flags=re.IGNORECASE)
Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN

note

截取字符串：

>>> data.str[:5]
Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

str.cat 方法可合并两列：

>>> data = [['Ali', 'Azmat', '30'],
            ['Sharukh', 'Khan', '40'],
            ['Linus', 'Torvalds', '70']]
>>> df = pd.DataFrame(data,columns=['First','Last','Age'])
>>> df
     First      Last Age
0      Ali     Azmat  30
1  Sharukh      Khan  40
2    Linus  Torvalds  70

>>> df['Full Name'] = df['First'].str.cat(df['Last'],sep=" ")
>>> df
     First      Last Age       Full Name
0      Ali     Azmat  30       Ali Azmat
1  Sharukh      Khan  40    Sharukh Khan
2    Linus  Torvalds  70  Linus Torvalds

字符串方法归纳

方法	说明
cat	实现不同列字符串连接操作，可指定分隔符
contains	Return boolean array if each string contains pattern/regex
count	Count occurrences of pattern
extract	Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith	相当于对每个元素执行 x.endswith(pattern)
startswith	相当于对每个元素执行 x.startswith(pattern)
findall	Compute list of all occurrences of pattern/regex for each string
get	Index into each element (retrieve i-th element)
isalnum	Equivalent to built-in str.alnum
isalpha	Equivalent to built-in str.isalpha
isdecimal	Equivalent to built-in str.isdecimal
isdigit	Equivalent to built-in str.isdigit
islower	Equivalent to built-in str.islower
isnumeric	Equivalent to built-in str.isnumeric
isupper	Equivalent to built-in str.isupper
join	Join strings in each element of the Series with passed separator
len	Compute length of each string
lower, upper	Convert cases; equivalent to x.lower() or x.upper() for each element
match	Use re.match with the passed regular expression on each element, returning matched groups as list
pad	Add whitespace to left, right, or both sides of strings
center	Equivalent to pad(side='both')
repeat	Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replace	Replace occurrences of pattern/regex with some other string
slice	Slice each string in the Series
split	Split strings on delimiter or regular expression
strip	Trim whitespace from both sides, including newlines
rstrip	Trim whitespace on right side
lstrip	Trim whitespace on left side

处理缺失数据​

缺失数据​

过滤 dropna​

填充 fillna​

数据转换​

移出重复数据​

函数映射​

替换值 replace​

索引重命名​

离散化​

字符串操作​

字符串方法​

正则表达式​

字符串函数​

处理缺失数据

缺失数据

过滤 dropna

填充 fillna

数据转换

移出重复数据

函数映射

替换值 replace

索引重命名

离散化

字符串操作

字符串方法

正则表达式

字符串函数