数据合并重塑
层次化索引
层次化索引(hierarchical indexing)能以低维度形式处理高维度数据。
Series
data = pd.Series(np.random.randn(9),
index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
>>> data
a 1 0.995134
2 -0.011890
3 0.295728
b 1 0.084780
3 -1.246741
c 1 0.249314
2 -1.139397
d 2 1.016694
3 0.294815
dtype: float64
>>> data.index
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
子集选取:
>>> data['b']
1 0.084780
3 -1.246741
dtype: float64
>>> data['b':'c']
b 1 0.084780
3 -1.246741
c 1 0.249314
2 -1.139397
dtype: float64
>>> data.loc[['b', 'd']]
b 1 0.084780
3 -1.246741
d 2 1.016694
3 0.294815
dtype: float64
内侧索引选取:
>>> data.loc[:, 2]
a -0.011890
c -1.139397
d 1.016694
dtype: float64
层次化索引在数据重塑和基于分组的操作(如透视表生成)中扮演着重要的角色。例如,可以通过 unstack 方法将这段数据重新安排到一个 DataFrame 中:
>>> data.unstack()
1 2 3
a 0.995134 -0.011890 0.295728
b 0.084780 NaN -1.246741
c 0.249314 -1.139397 NaN
d NaN 1.016694 0.294815
stack()
>>> data.unstack().stack()
a 1 0.995134
2 -0.011890
3 0.295728
b 1 0.084780
3 -1.246741
c 1 0.249314
2 -1.139397
d 2 1.016694
3 0.294815
dtype: float64
DataFrame
>>> frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'],
[1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
>>> frame
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
可以对多层索引设置名称:
>>> frame.index.names = ['key1', 'key2']
>>> frame.columns.names = ['state', 'color']
>>> frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
选取分组:
>>> frame['Ohio']
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
层次化索引可以单独创建:
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']],
names=['state', 'color'])
排序
swaplevel 可以互换索引级别。(数据不会改动)
>>> frame.swaplevel('key1', 'key2')
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
而 sort_index 则根据单个级别中的值对数据进行排序。
>>> frame.sort_index(level=1)
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
交换级别时,常常也会用到 sort_index,这样最终结果就是按照指定顺序进行字母排序了:
>>> frame.swaplevel(0, 1).sort_index(level=0)
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
统计
通过 level 选项指定统计级别:
>>> frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
>>> frame.sum(level='key2')
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
>>> frame.sum(level='color', axis=1)
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
使用DataFrame的列进行索引
人们经常想要将 DataFrame 的一个或多个列当做行索引来用,或者可能希望将行索引变成 DataFrame 的列。
>>> frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two',
'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
>>> frame
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
DataFrame 的 set_index 函数会将其一个或多个列转换为行索引,并创建一个新的DataFrame:
>>> frame2 = frame.set_index(['c', 'd'])
>>> frame2
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
默认情况下,转换为索引的列会从 DataFrame 中移除,但也可以将其保留下来:
>>> frame.set_index(['c', 'd'], drop=False)
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
reset_index 的功能跟 set_index 刚好相反,层次化索引的级别会被转移到列里面:
>>> frame2.reset_index()
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
合并数据
merge
多对一
类似于数据库的 join。
>>> df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
>>> df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
'data2': range(3)})
>>> df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
>>> df2
key data2
0 a 0
1 b 1
2 d 2
默认情况下,merge就会将重叠列的列名当做键:
>>> pd.merge(df1, df2)
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
>>> pd.merge(df1, df2, on='key')
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
默认情况下,merge 使用内链接 inner,结果是键的交集。
>>> pd.merge(df1, df2, how='outer')
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 6.0 1.0
3 a 2.0 0.0
4 a 4.0 0.0
5 a 5.0 0.0
6 c 3.0 NaN
7 d NaN 2.0
>>> df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
>>> df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
'data2': range(3)})
>>> df3
lkey data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
>>> df4
rkey data2
0 a 0
1 b 1
2 d 2
>>> pd.merge(df3, df4, left_on='lkey', right_on='rkey')
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
多对多
>>> df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
>>> df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
'data2': range(5)})
>>> df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
>>> df2
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
>>> pd.merge(df1, df2, on='key', how='left')
key data1 data2
0 b 0 1.0
1 b 0 3.0
2 b 1 1.0
3 b 1 3.0
4 a 2 0.0
5 a 2 2.0
6 c 3 NaN
7 a 4 0.0
8 a 4 2.0
9 b 5 1.0
10 b 5 3.0
多对多连接产生的是行的 笛卡尔积。由于左边的 DataFrame 有 3 个 b 行,右边的有 2 个,所以最终结果中就有 6 个 b 行。
>>> pd.merge(df1, df2, how='inner')
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2
多个键
>>> left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})
>>> right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
>>> left
key1 key2 lval
0 foo one 1
1 foo two 2
2 bar one 3
>>> right
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0
重复列名
merge 有一个更实用的 suffixes 选项,用于指定附加到左右两个 DataFrame 对象的重叠列名上的字符串:
>>> pd.merge(left, right, on='key1')
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
>>> pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
基于索引合并
有时候,DataFrame 中的连接键位于其索引中。在这种情况下,你可以传入left_index=True 或 right_index=True(或两个都传)以说明索引应该被用作连接键:
>>> left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
'value': range(6)})
>>> right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
>>> left1
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
>>> right1
group_val
a 3.5
b 7.0
>>> pd.merge(left1, right1, left_on='key', right_index=True)
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
>>> pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
同时使用合并双方的索引也没问题:
>>> left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
index=['a', 'c', 'e'],
columns=['Ohio', 'Nevada'])
>>> right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
index=['b', 'c', 'd', 'e'],
columns=['Missouri', 'Alabama'])
>>> left2
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
>>> right2
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
>>> pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
基于轴合并数据
>>> s1 = pd.Series([0, 1], index=['a', 'b'])
>>> s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
>>> s3 = pd.Series([5, 6], index=['f', 'g'])
>>> s1
a 0
b 1
dtype: int64
>>> s2
c 2
d 3
e 4
dtype: int64
>>> s3
f 5
g 6
dtype: int64
>>> pd.concat([s1, s2, s3])
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
默认情况下,concat 是在 axis=0 上工作的,最终产生一个新的 Series。如果传入 axis=1,则结果就会变成一个 DataFrame( axis=1 是列):
>>> pd.concat([s1, s2, s3], axis=1)
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
传入 join='inner' 即可得到它们的交集:
>>> s4
a 0
b 1
f 5
g 6
dtype: int64
>>> pd.concat([s1, s4], axis=1)
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
>>> pd.concat([s1, s4], axis=1, join='inner')
0 1
a 0 0
b 1 1
假设你想要在连接轴上创建一个层次化索引,使用 keys 参数即可达到这个目的:
>>> result = pd.concat([s1, s1, s3], keys=['one','two', 'three'])
>>> result
one a 0
b 1
two a 0
b 1
three f 5
g 6
dtype: int64
如果沿着 axis=1 对 Series 进行合并,则 keys 就会成为 DataFrame 的列头:
>>> pd.concat([s1, s2, s3], axis=1, keys=['one','two', 'three'])
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
同样的逻辑也适用于 DataFrame 对象:
>>> df1 = pd.DataFrame(np.arange(6).reshape(3, 2),
index=['a', 'b', 'c'],columns=['one', 'two'])
>>> df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2),
index=['a','c'],columns=['three', 'four'])
>>> df1
one two
a 0 1
b 2 3
c 4 5
>>> df2
three four
a 5 6
c 7 8
>>> pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
上面的 DataFrame 连接也可以通过字典的方式:
>>> pd.concat({'level1': df1, 'level2': df2}, axis=1)
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
DataFrame 连接时忽略索引:
>>> df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
>>> df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
>>> df1
a b c d
0 0.799077 -1.133295 -1.042869 -0.018429
1 1.424598 -1.170022 1.052247 0.035564
2 1.925944 -0.202279 -1.461569 -0.422749
>>> df2
b d a
0 1.032928 -2.137308 1.197095
1 -0.204432 2.012739 1.442443
>>> pd.concat([df1, df2], ignore_index=True)
a b c d
0 0.799077 -1.133295 -1.042869 -0.018429
1 1.424598 -1.170022 1.052247 0.035564
2 1.925944 -0.202279 -1.461569 -0.422749
3 1.197095 1.032928 NaN -2.137308
4 1.442443 -0.204432 NaN 2.012739
>>> pd.concat([df1, df2])
a b c d
0 0.799077 -1.133295 -1.042869 -0.018429
1 1.424598 -1.170022 1.052247 0.035564
2 1.925944 -0.202279 -1.461569 -0.422749
0 1.197095 1.032928 NaN -2.137308
1 1.442443 -0.204432 NaN 2.012739
合并覆盖数据
combine_first 用传递对象中的数据为调用对象的缺失数据 “打补丁”:
>>> df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],'b': [np.nan, 2., np.nan, 6.],'c': range(2, 18, 4)})
>>> df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],'b': [np.nan, 3., 4., 6., 8.]})
>>> df1
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
>>> df2
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
>>> df1.combine_first(df2)
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
数据重塑
重塑层次化索引
- stack :将数据的列“旋转”为行
- unstack :将数据的行“旋转”为列
>>> data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio','Colorado'],name='state'),
columns=pd.Index(['one', 'two', 'three'],name='number'))
>>> data
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
>>> result = data.stack()
>>> result
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int64
>>> result.unstack()
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
默认情况下,unstack 操作的是最内层( stack 也是如此)。传入分层级别的编号或名称即可对其它级别进行 unstack 操作:
>>> result.unstack(0)
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
>>> result.unstack('state')
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
如果不是所有的级别值都能在各分组中找到的话,则 unstack 操作可能会引入缺失数据:
>>> s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
>>> s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
>>> data2 = pd.concat([s1, s2], keys=['one', 'two'])
>>> data2
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
>>> data2.unstack()
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
stack 默认会滤除缺失数据,因此该运算是可逆的:
>>> data2.unstack()
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
>>> data2.unstack().stack()
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
>>> data2.unstack().stack(dropna=False)
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
在对 DataFrame 进行 unstack 操作时,作为旋转轴的级别将会成为结果中的最低级别:
>>> df = pd.DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'],name='side'))
>>> df
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
>>> df.unstack('state')
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
当调用 stack ,我们可以指明轴的名字:
>>> df.unstack('state').stack('side')
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7
数据扁平化
一个常见的时间序列数据如下:
>>> ldata[:10]
date item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
5 1959-06-30 unemp 5.100
6 1959-09-30 realgdp 2775.488
7 1959-09-30 infl 2.740
8 1959-09-30 unemp 5.300
9 1959-12-31 realgdp 2785.204
date 列的日期值包含重复数据,有时候,我们希望以唯一的时间序列来查看数据,则可以使用
pivot 方法进行转换:
>>> pivoted = ldata.pivot('date', 'item', 'value')
>>> pivoted
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
1960-06-30 0.14 2834.390 5.2
1960-09-30 2.70 2839.022 5.6
1960-12-31 1.21 2802.616 6.3
1961-03-31 -0.40 2819.264 6.8
1961-06-30 1.47 2872.005 7.0
... ... ... ...
2007-06-30 2.75 13203.977 4.5
2007-09-30 3.45 13321.109 4.7
2007-12-31 6.38 13391.249 4.8
2008-03-31 2.82 13366.865 4.9
2008-06-30 8.53 13415.266 5.4
2008-09-30 -3.16 13324.600 6.0
2008-12-31 -8.79 13141.920 6.9
2009-03-31 0.94 12925.410 8.1
2009-06-30 3.37 12901.504 9.2
2009-09-30 3.56 12990.341 9.6
[203 rows x 3 columns]
前两个传递的参数值分别用作行和列索引(data、item),最后一个可选值则是用于填充DataFrame 的数据列。
如果填充 DataFrame 的数据列有两列的话:
>>> ldata['value2'] = np.random.randn(len(ldata))
>>> ldata[:10]
date item value value2
0 1959-03-31 realgdp 2710.349 0.523772
1 1959-03-31 infl 0.000 0.000940
2 1959-03-31 unemp 5.800 1.343810
3 1959-06-30 realgdp 2778.801 -0.713544
4 1959-06-30 infl 2.340 -0.831154
5 1959-06-30 unemp 5.100 -2.370232
6 1959-09-30 realgdp 2775.488 -1.860761
7 1959-09-30 infl 2.740 -0.860757
8 1959-09-30 unemp 5.300 0.560145
9 1959-12-31 realgdp 2785.204 -1.265934
pivot 方法忽略最后一个参数:
>>> pivoted = ldata.pivot('date', 'item')
>>> pivoted[:5]
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 0.000940 0.523772 1.343810
1959-06-30 2.34 2778.801 5.1 -0.831154 -0.713544 -2.370232
1959-09-30 2.74 2775.488 5.3 -0.860757 -1.860761 0.560145
1959-12-31 0.27 2785.204 5.6 0.119827 -1.265934 -1.063512
1960-03-31 2.31 2847.699 5.2 -2.359419 0.332883 -0.199543
>>> pivoted['value'][:5]
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2
pivot 其实就是用 set_index 创建层次化索引,再用 unstack 重塑:
>>> unstacked = ldata.set_index(['date', 'item']).unstack('item')
>>> unstacked[:7]
value value2
item infl realgdp unemp infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8 0.000940 0.523772 1.343810
1959-06-30 2.34 2778.801 5.1 -0.831154 -0.713544 -2.370232
1959-09-30 2.74 2775.488 5.3 -0.860757 -1.860761 0.560145
1959-12-31 0.27 2785.204 5.6 0.119827 -1.265934 -1.063512
1960-03-31 2.31 2847.699 5.2 -2.359419 0.332883 -0.199543
1960-06-30 0.14 2834.390 5.2 -0.970736 -1.541996 -1.307030
1960-09-30 2.70 2839.022 5.6 0.377984 0.286350 -0.753887
pivot 逆运算
>>> df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
'A': [1, 2, 3],'B': [4, 5, 6],'C': [7, 8, 9]})
>>> df
key A B C
0 foo 1 4 7
1 bar 2 5 8
2 baz 3 6 9
指定 key 为分组标签:
>>> melted = pd.melt(df, ['key'])
>>> melted
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
6 foo C 7
7 bar C 8
8 baz C 9
pivot 重塑:
>>> reshaped = melted.pivot('key', 'variable', 'value')
>>> reshaped
variable A B C
key
bar 2 5 8
baz 3 6 9
foo 1 4 7
因为 pivot 的结果从列创建了一个索引,用作行标签,我们可以使用 reset_index 将数据移回列:
>>> reshaped.reset_index()
variable key A B C
0 bar 2 5 8
1 baz 3 6 9
2 foo 1 4 7
你还可以指定列的子集,作为值的列:
>>> pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
pandas.melt 也可以不用分组标签:
>>> pd.melt(df, value_vars=['A', 'B', 'C'])
variable value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
>>> pd.melt(df, value_vars=['key', 'A', 'B'])
variable value
0 key foo
1 key bar
2 key baz
3 A 1
4 A 2
5 A 3
6 B 4
7 B 5
8 B 6