数据清洗
处理缺失数据
缺失数据
在 pandas 中,缺失值通常表示为 na,它表示 not available。对于数值数据来说,pandas 使用浮点值 nan(Not a Number)表示缺失数据。
>>> string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
>>> string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
>>> string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
>>> string_data[0] = None
>>> string_data.isnull()
0 True
1 False
2 True
3 False
dtype: bool
过滤 dropna
>>> from numpy import nan as NA
>>> data = pd.Series([1, NA, 3.5, NA, 7])
>>> data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
>>> data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
>>> data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
>>> data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
>>> data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
>>> cleaned = data.dropna()
>>> cleaned
0 1 2
0 1.0 6.5 3.0
>>> data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
>>> data[4] = NA
>>> data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
>>> data.dropna(axis=1, how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
填充 fillna
>>> df
0 1 2
0 0.718710 NaN NaN
1 0.219162 NaN NaN
2 -1.670341 NaN -1.951170
3 -0.362809 NaN -1.309137
4 -1.006110 -0.164043 -0.892337
5 1.226407 1.449656 1.484279
6 -0.523763 1.484114 -0.207278
>>> df.fillna(0)
0 1 2
0 0.718710 0.000000 0.000000
1 0.219162 0.000000 0.000000
2 -1.670341 0.000000 -1.951170
3 -0.362809 0.000000 -1.309137
4 -1.006110 -0.164043 -0.892337
5 1.226407 1.449656 1.484279
6 -0.523763 1.484114 -0.207278
>>> df.fillna({1: 0.5, 2: 0})
0 1 2
0 0.718710 0.500000 0.000000
1 0.219162 0.500000 0.000000
2 -1.670341 0.500000 -1.951170
3 -0.362809 0.500000 -1.309137
4 -1.006110 -0.164043 -0.892337
5 1.226407 1.449656 1.484279
6 -0.523763 1.484114 -0.207278
>>> df.fillna(0, inplace=True)
>>> df
0 1 2
0 0.718710 0.000000 0.000000
1 0.219162 0.000000 0.000000
2 -1.670341 0.000000 -1.951170
3 -0.362809 0.000000 -1.309137
4 -1.006110 -0.164043 -0.892337
5 1.226407 1.449656 1.484279
6 -0.523763 1.484114 -0.207278
数据转换
移出重复数据
>>> data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
>>> data
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
>>> data.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
>>> data.drop_duplicates()
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
>>> data['v1'] = range(7)
>>> data
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
>>> data.drop_duplicates(['k1'])
k1 k2 v1
0 one 1 0
1 two 1 1
drop_duplicates 默认保留的是第一个出现的值组合。传入 keep='last' 则保留最后一个。
>>> data.drop_duplicates(['k1', 'k2'], keep='last')
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6
函数映射
>>> data = pd.DataFrame({'food': ['bacon', 'pulled pork',
'bacon','Pastrami', 'corned beef',
'Bacon','pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
>>> data
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
肉类到动物的映射:
>>> meat_to_animal = {
... 'bacon': 'pig',
... 'pulled pork': 'pig',
... 'pastrami': 'cow',
... 'corned beef': 'cow',
... 'honey ham': 'pig',
... 'nova lox': 'salmon'
... }
>>> lowercased = data['food'].str.lower()
>>> lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
>>> data['animal'] = lowercased.map(meat_to_animal)
>>> data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
>>> data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
替换值 replace
>>> data = pd.Series([1., -999., 2., -999., -1000., 3.])
>>> data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
>>> data.replace(-999, np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
>>> data.replace([-999, -1000], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
>>> data.replace([-999, -1000], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
索引重命名
>>> data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
>>> transform = lambda x: x[:4].upper()
>>> data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
>>> data.index = data.index.map(transform)
>>> data
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
data.rename(index=str.title, columns=str.upper)
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
>>> data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
>>> data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
>>> data
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
离散化
>>> ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
>>> bins = [18, 25, 35, 60, 100]
>>> cats = pd.cut(ages, bins)
>>> cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
>>> cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
>>> cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
>>> pd.value_counts(cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
>>> pd.cut(ages, [18, 26, 36, 61, 100], right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
>>> group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
>>> pd.cut(ages, bins, labels=group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
字符串操作
字符串方法
对于许多字符串处理和脚本应用,内置的字符串方法已经能够满足要求了。例如,以逗号分隔的字符串可以用 split 拆分成数段:
>>> val = 'a,b, guido'
>>> val.split(',')
['a', 'b', ' guido']
split 常常与 strip 一起使用,以去除空白符(包括换行符):
>>> pieces = [x.strip() for x in val.split(',')]
>>> pieces
['a', 'b', 'guido']
利用加法,可以将这些子字符串以双冒号分隔符的形式连接起来:
>>> first, second, third = pieces
>>> first + '::' + second + '::' + third
'a::b::guido'
>>> '::'.join(pieces)
'a::b::guido'
检测子串的最佳方式是利用 Python 的 in 关键字,还可以使用 index 和 find:
>>> val
'a,b, guido'
>>> 'guido' in val
True
>>> val.index(',')
1
>>> val.find(':')
-1
注意 find 和 index 的区别:如果找不到字符串,index 将会引发一个异常(而不是返回-1)。
>>> val.index(':')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: substring not found
count 可以返回指定子串的出现次数:
>>> val.count(',')
2
replace 传入空字符串,可以用于删除:
>>> val.replace(',', '::')
'a::b:: guido'
>>> val.replace(',', '')
'ab guido'
正则表达式
假设我想要拆分一个字符串,分隔符为数量不定的一组空白符(制表符、空格、换行符等)。描述一个或多个空白符的 regex 是 \s+:
>>> import re
>>> text = "foo bar\t baz \tqux"
>>> re.split('\s+', text)
['foo', 'bar', 'baz', 'qux']
>>> regex = re.compile('\s+')
>>> regex.split(text)
['foo', 'bar', 'baz', 'qux']
如果只希望得到匹配 regex 的所有模式,则可以使用 findall 方法:
>>> regex.findall(text)
[' ', '\t ', ' \t']
match 和 search 跟 findall 功能类似。findall 返回的是字符串中所有的匹配项,而 search 则只返回第一个匹配项。match 更加严格,它只匹配字符串的首部。
>>> text = """Dave dave@google.com
... Steve steve@gmail.com
... Rob rob@gmail.com
... Ryan ryan@yahoo.com
... """
>>> text
'Dave dave@google.com\nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com\n'
>>> pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
对 text 使用 findall 将得到一组电子邮件地址:
>>> regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
search 返回的是文本中第一个电子邮件地址(以特殊的匹配项对象形式返回)。对于上面那个 regex,匹配项对象能告诉我们模式在原字符串中的起始和结束位置:
>>> m = regex.search(text)
>>> m
<re.Match object; span=(5, 20), match='dave@google.com'>
>>> text[m.start():m.end()]
'dave@google.com'
>>> m.group()
'dave@google.com'
regex.match 则将返回 None,因为它只匹配出现在字符串开头的模式:
>>> print(regex.match(text))
None
regex.sub 方法可以将匹配到的模式替换为指定字符串,并返回所得到的新字符串:
>>> print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
字符串函数
清理待分析的散乱数据时,常常需要做一些字符串规整化工作。更为复杂的情况是,含有字符串的列有时还含有缺失数据:
>>> data = {'Dave': 'dave@google.com', 'Steve':
'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}
>>> data = pd.Series(data)
>>> data
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
>>> data.isnull()
Dave False
Steve False
Rob False
Wes True
dtype: bool
通过 data.map,所有字符串和正则表达式方法都能被应用于(传入 lambda 表达式或其他函数)各个值,但是如果存在 NA(null)就会报错。为了解决这个问题,Series 有一些能够跳过 NA 值的面向数组方法,进行字符串操作。通过 Series 的 str 属性即可访问这些方法。
通过 str.contains 检查各个电子邮件地址是否含有 "gmail":
>>> data.str.contains('gmail')
Dave False
Steve True
Rob True
Wes NaN
dtype: object
使用正则表达式:
>>> pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
>>> data.str.findall(pattern, flags=re.IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
截取字符串:
>>> data.str[:5]
Dave dave@
Steve steve
Rob rob@g
Wes NaN
dtype: object
str.cat 方法可合并两列:
>>> data = [['Ali', 'Azmat', '30'],
['Sharukh', 'Khan', '40'],
['Linus', 'Torvalds', '70']]
>>> df = pd.DataFrame(data,columns=['First','Last','Age'])
>>> df
First Last Age
0 Ali Azmat 30
1 Sharukh Khan 40
2 Linus Torvalds 70
>>> df['Full Name'] = df['First'].str.cat(df['Last'],sep=" ")
>>> df
First Last Age Full Name
0 Ali Azmat 30 Ali Azmat
1 Sharukh Khan 40 Sharukh Khan
2 Linus Torvalds 70 Linus Torvalds
方法 | 说明 |
---|---|
cat | 实现不同列字符串连接操作,可指定分隔符 |
contains | Return boolean array if each string contains pattern/regex |
count | Count occurrences of pattern |
extract | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group |
endswith | 相当于对每个元素执行 x.endswith(pattern) |
startswith | 相当于对每个元素执行 x.startswith(pattern) |
findall | Compute list of all occurrences of pattern/regex for each string |
get | Index into each element (retrieve i-th element) |
isalnum | Equivalent to built-in str.alnum |
isalpha | Equivalent to built-in str.isalpha |
isdecimal | Equivalent to built-in str.isdecimal |
isdigit | Equivalent to built-in str.isdigit |
islower | Equivalent to built-in str.islower |
isnumeric | Equivalent to built-in str.isnumeric |
isupper | Equivalent to built-in str.isupper |
join | Join strings in each element of the Series with passed separator |
len | Compute length of each string |
lower, upper | Convert cases; equivalent to x.lower() or x.upper() for each element |
match | Use re.match with the passed regular expression on each element, returning matched groups as list |
pad | Add whitespace to left, right, or both sides of strings |
center | Equivalent to pad(side='both') |
repeat | Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string) |
replace | Replace occurrences of pattern/regex with some other string |
slice | Slice each string in the Series |
split | Split strings on delimiter or regular expression |
strip | Trim whitespace from both sides, including newlines |
rstrip | Trim whitespace on right side |
lstrip | Trim whitespace on left side |