遍历行并扩展pandas数据框
我有pandas数据框,其中的一列包含值或值列表(长度不等).我想扩展"行,因此列表中的每个值都变成列中的单个值.一个例子说明了一切:
I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})
location name
0 Amsterdam Tom
1 [Berlin, Paris] Jim
2 [Antwerp, Barcelona, Pisa] Claus
我想变成:
dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Antwerp Claus
4 Barcelona Claus
5 Pisa Claus
我首先尝试使用Apply,但据我所知不可能返回多个Series.遍历似乎是诀窍.但是下面的代码给了我一个空的数据框...
I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...
def duplicator(series):
if type(series['location']) == list:
for location in series['location']:
subSeries = series
subSeries['location'] = location
dfOut.append(subSeries)
else:
dfOut.append(series)
for index, row in dfIn.iterrows():
duplicator(row)
如果返回的序列的index
是位置列表,则dfIn.apply
会将这些序列整理为表格:
If you return a series whose index
is a list of locations, then dfIn.apply
will collate those series into a table:
import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'],
['Antwerp','Barcelona','Pisa'] ]})
def expand(row):
locations = row['location'] if isinstance(row['location'], list) else [row['location']]
s = pd.Series(row['name'], index=list(set(locations)))
return s
In [156]: dfIn.apply(expand, axis=1)
Out[156]:
Amsterdam Antwerp Barcelona Berlin Paris Pisa
0 Tom NaN NaN NaN NaN NaN
1 NaN NaN NaN Jim Jim NaN
2 NaN Claus Claus NaN NaN Claus
然后您可以堆叠此DataFrame以获得:
You can then stack this DataFrame to obtain:
In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]:
0 Amsterdam Tom
1 Berlin Jim
Paris Jim
2 Antwerp Claus
Barcelona Claus
Pisa Claus
dtype: object
这是一个系列,而您需要一个DataFrame.稍微按摩一下reset_index
即可获得所需的结果:
This is a Series, while you want a DataFrame. A little massaging with reset_index
gives you the desired result:
dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)
收益
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Amsterdam Claus
4 Antwerp Claus
5 Barcelona Claus