Skip to main content

数据清洗

处理缺失数据

缺失数据

pandas 中,缺失值通常表示为 na,它表示 not available。对于数值数据来说,pandas 使用浮点值 nan(Not a Number)表示缺失数据。

>>> string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

>>> string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object

>>> string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
Python 内置的 None 值在对象数组中也可以作为 NA
>>> string_data[0] = None

>>> string_data.isnull()
0 True
1 False
2 True
3 False
dtype: bool

过滤 dropna

>>> from numpy import nan as NA
>>> data = pd.Series([1, NA, 3.5, NA, 7])

>>> data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64

>>> data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
等价于
>>> data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
对于 DataFrame,dropna 默认丢弃任何含有缺失值的行
>>> data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
>>> data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

>>> cleaned = data.dropna()
>>> cleaned
0 1 2
0 1.0 6.5 3.0
传入 how='all' 将只丢弃全为NA的那些行
>>> data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
如果要过滤列,只需传入 axis=1 即可
>>> data[4] = NA
>>> data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
>>> data.dropna(axis=1, how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

填充 fillna

>>> df
0 1 2
0 0.718710 NaN NaN
1 0.219162 NaN NaN
2 -1.670341 NaN -1.951170
3 -0.362809 NaN -1.309137
4 -1.006110 -0.164043 -0.892337
5 1.226407 1.449656 1.484279
6 -0.523763 1.484114 -0.207278

>>> df.fillna(0)
0 1 2
0 0.718710 0.000000 0.000000
1 0.219162 0.000000 0.000000
2 -1.670341 0.000000 -1.951170
3 -0.362809 0.000000 -1.309137
4 -1.006110 -0.164043 -0.892337
5 1.226407 1.449656 1.484279
6 -0.523763 1.484114 -0.207278
不同列填充不同值
>>> df.fillna({1: 0.5, 2: 0})
0 1 2
0 0.718710 0.500000 0.000000
1 0.219162 0.500000 0.000000
2 -1.670341 0.500000 -1.951170
3 -0.362809 0.500000 -1.309137
4 -1.006110 -0.164043 -0.892337
5 1.226407 1.449656 1.484279
6 -0.523763 1.484114 -0.207278
对现有对象直接修改
>>> df.fillna(0, inplace=True)
>>> df
0 1 2
0 0.718710 0.000000 0.000000
1 0.219162 0.000000 0.000000
2 -1.670341 0.000000 -1.951170
3 -0.362809 0.000000 -1.309137
4 -1.006110 -0.164043 -0.892337
5 1.226407 1.449656 1.484279
6 -0.523763 1.484114 -0.207278

数据转换

移出重复数据

>>> data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
>>> data
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
duplicated 表示各行是否是重复行
>>> data.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
drop_duplicates 默认判断全部的列
>>> data.drop_duplicates()
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
drop_duplicates 指定部分列进行重复项判断
>>> data['v1'] = range(7)
>>> data
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6

>>> data.drop_duplicates(['k1'])
k1 k2 v1
0 one 1 0
1 two 1 1
tip

drop_duplicates 默认保留的是第一个出现的值组合。传入 keep='last' 则保留最后一个。

>>> data.drop_duplicates(['k1', 'k2'], keep='last')
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6

函数映射

>>> data = pd.DataFrame({'food': ['bacon', 'pulled pork', 
'bacon','Pastrami', 'corned beef',
'Bacon','pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
>>> data
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0

肉类到动物的映射:

>>> meat_to_animal = {
... 'bacon': 'pig',
... 'pulled pork': 'pig',
... 'pastrami': 'cow',
... 'corned beef': 'cow',
... 'honey ham': 'pig',
... 'nova lox': 'salmon'
... }
统一大小写
>>> lowercased = data['food'].str.lower()
>>> lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
>>> data['animal'] = lowercased.map(meat_to_animal)
>>> data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
lambda
>>> data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object

替换值 replace

>>> data = pd.Series([1., -999., 2., -999., -1000., 3.])
>>> data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
替换 -999
>>> data.replace(-999, np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
替换多个值
>>> data.replace([-999, -1000], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
替换不同的值
>>> data.replace([-999, -1000], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64

索引重命名

>>> data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11

>>> transform = lambda x: x[:4].upper()

>>> data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
直接赋值修改
>>> data.index = data.index.map(transform)
>>> data
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
rename
data.rename(index=str.title, columns=str.upper)
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
rename还可以结合字典对部分轴标签的更新
>>> data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
rename 直接修改
>>> data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
>>> data
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

离散化

>>> ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
划分
>>> bins = [18, 25, 35, 60, 100]
>>> cats = pd.cut(ages, bins)
>>> cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
统计
>>> cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

>>> cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')

>>> pd.value_counts(cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
调整闭端
>>> pd.cut(ages, [18, 26, 36, 61, 100], right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
设置区间名称
>>> group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
>>> pd.cut(ages, bins, labels=group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

字符串操作

字符串方法

对于许多字符串处理和脚本应用,内置的字符串方法已经能够满足要求了。例如,以逗号分隔的字符串可以用 split 拆分成数段:

>>> val = 'a,b,  guido'
>>> val.split(',')
['a', 'b', ' guido']
tip

split 常常与 strip 一起使用,以去除空白符(包括换行符):

>>> pieces = [x.strip() for x in val.split(',')]
>>> pieces
['a', 'b', 'guido']

利用加法,可以将这些子字符串以双冒号分隔符的形式连接起来:

>>> first, second, third = pieces
>>> first + '::' + second + '::' + third
'a::b::guido'
join
>>> '::'.join(pieces)
'a::b::guido'

检测子串的最佳方式是利用 Pythonin 关键字,还可以使用 indexfind

>>> val
'a,b, guido'

>>> 'guido' in val
True
>>> val.index(',')
1
>>> val.find(':')
-1
caution

注意 findindex 的区别:如果找不到字符串,index 将会引发一个异常(而不是返回-1)。

>>> val.index(':')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: substring not found

count 可以返回指定子串的出现次数:

>>> val.count(',')
2

replace 传入空字符串,可以用于删除:

>>> val.replace(',', '::')
'a::b:: guido'
>>> val.replace(',', '')
'ab guido'

正则表达式

假设我想要拆分一个字符串,分隔符为数量不定的一组空白符(制表符、空格、换行符等)。描述一个或多个空白符的 regex\s+

>>> import re
>>> text = "foo bar\t baz \tqux"
>>> re.split('\s+', text)
['foo', 'bar', 'baz', 'qux']
regex
>>> regex = re.compile('\s+')
>>> regex.split(text)
['foo', 'bar', 'baz', 'qux']

如果只希望得到匹配 regex 的所有模式,则可以使用 findall 方法:

>>> regex.findall(text)
[' ', '\t ', ' \t']
info

matchsearchfindall 功能类似。findall 返回的是字符串中所有的匹配项,而 search 则只返回第一个匹配项。match 更加严格,它只匹配字符串的首部。

>>> text = """Dave dave@google.com
... Steve steve@gmail.com
... Rob rob@gmail.com
... Ryan ryan@yahoo.com
... """
>>> text
'Dave dave@google.com\nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'

>>> pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
note

text 使用 findall 将得到一组电子邮件地址:

>>> regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
note

search 返回的是文本中第一个电子邮件地址(以特殊的匹配项对象形式返回)。对于上面那个 regex,匹配项对象能告诉我们模式在原字符串中的起始和结束位置:

>>> m = regex.search(text)
>>> m
<re.Match object; span=(5, 20), match='dave@google.com'>
>>> text[m.start():m.end()]
'dave@google.com'
>>> m.group()
'dave@google.com'
note

regex.match 则将返回 None,因为它只匹配出现在字符串开头的模式:

>>> print(regex.match(text))
None
note

regex.sub 方法可以将匹配到的模式替换为指定字符串,并返回所得到的新字符串:

>>> print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

字符串函数

清理待分析的散乱数据时,常常需要做一些字符串规整化工作。更为复杂的情况是,含有字符串的列有时还含有缺失数据:

>>> data = {'Dave': 'dave@google.com', 'Steve': 
'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}
>>> data = pd.Series(data)
>>> data
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object

>>> data.isnull()
Dave False
Steve False
Rob False
Wes True
dtype: bool

通过 data.map,所有字符串和正则表达式方法都能被应用于(传入 lambda 表达式或其他函数)各个值,但是如果存在 NA(null)就会报错。为了解决这个问题,Series 有一些能够跳过 NA 值的面向数组方法,进行字符串操作。通过 Seriesstr 属性即可访问这些方法。

note

通过 str.contains 检查各个电子邮件地址是否含有 "gmail"

>>> data.str.contains('gmail')
Dave False
Steve True
Rob True
Wes NaN
dtype: object
note

使用正则表达式:

>>> pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
>>> data.str.findall(pattern, flags=re.IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
note

截取字符串:

>>> data.str[:5]
Dave dave@
Steve steve
Rob rob@g
Wes NaN
dtype: object

str.cat 方法可合并两列:

>>> data = [['Ali', 'Azmat', '30'],
['Sharukh', 'Khan', '40'],
['Linus', 'Torvalds', '70']]
>>> df = pd.DataFrame(data,columns=['First','Last','Age'])
>>> df
First Last Age
0 Ali Azmat 30
1 Sharukh Khan 40
2 Linus Torvalds 70

>>> df['Full Name'] = df['First'].str.cat(df['Last'],sep=" ")
>>> df
First Last Age Full Name
0 Ali Azmat 30 Ali Azmat
1 Sharukh Khan 40 Sharukh Khan
2 Linus Torvalds 70 Linus Torvalds
字符串方法归纳
方法说明
cat实现不同列字符串连接操作,可指定分隔符
containsReturn boolean array if each string contains pattern/regex
countCount occurrences of pattern
extractUse a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith相当于对每个元素执行 x.endswith(pattern)
startswith相当于对每个元素执行 x.startswith(pattern)
findallCompute list of all occurrences of pattern/regex for each string
getIndex into each element (retrieve i-th element)
isalnumEquivalent to built-in str.alnum
isalphaEquivalent to built-in str.isalpha
isdecimalEquivalent to built-in str.isdecimal
isdigitEquivalent to built-in str.isdigit
islowerEquivalent to built-in str.islower
isnumericEquivalent to built-in str.isnumeric
isupperEquivalent to built-in str.isupper
joinJoin strings in each element of the Series with passed separator
lenCompute length of each string
lower, upperConvert cases; equivalent to x.lower() or x.upper() for each element
matchUse re.match with the passed regular expression on each element, returning matched groups as list
padAdd whitespace to left, right, or both sides of strings
centerEquivalent to pad(side='both')
repeatDuplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replaceReplace occurrences of pattern/regex with some other string
sliceSlice each string in the Series
splitSplit strings on delimiter or regular expression
stripTrim whitespace from both sides, including newlines
rstripTrim whitespace on right side
lstripTrim whitespace on left side