遍历行并扩展pandas数据框

问题描述：

我有pandas数据框，其中的一列包含值或值列表(长度不等).我想扩展"行，因此列表中的每个值都变成列中的单个值.一个例子说明了一切:

I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:

dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
 u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})

    location     name
0   Amsterdam   Tom
1   [Berlin, Paris] Jim
2   [Antwerp, Barcelona, Pisa]  Claus

我想变成:

dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})

    location     name
0   Amsterdam   Tom
1   Berlin   Jim
2   Paris   Jim
3   Antwerp Claus
4   Barcelona   Claus
5   Pisa    Claus

我首先尝试使用Apply，但据我所知不可能返回多个Series.遍历似乎是诀窍.但是下面的代码给了我一个空的数据框...

I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...

def duplicator(series):
    if type(series['location']) == list:
        for location in series['location']:
            subSeries = series
            subSeries['location'] = location
            dfOut.append(subSeries)
    else:
        dfOut.append(series)

for index, row in dfIn.iterrows():
    duplicator(row)

答

如果返回的序列的index是位置列表，则dfIn.apply会将这些序列整理为表格:

If you return a series whose index is a list of locations, then dfIn.apply will collate those series into a table:

import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
                     u'location': ['Amsterdam', ['Berlin','Paris'],
                                   ['Antwerp','Barcelona','Pisa'] ]})

def expand(row):
    locations = row['location'] if isinstance(row['location'], list) else [row['location']]
    s = pd.Series(row['name'], index=list(set(locations)))
    return s

In [156]: dfIn.apply(expand, axis=1)
Out[156]: 
  Amsterdam Antwerp Barcelona Berlin Paris   Pisa
0       Tom     NaN       NaN    NaN   NaN    NaN
1       NaN     NaN       NaN    Jim   Jim    NaN
2       NaN   Claus     Claus    NaN   NaN  Claus

然后您可以堆叠此DataFrame以获得:

You can then stack this DataFrame to obtain:

In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]: 
0  Amsterdam      Tom
1  Berlin         Jim
   Paris          Jim
2  Antwerp      Claus
   Barcelona    Claus
   Pisa         Claus
dtype: object

这是一个系列，而您需要一个DataFrame.稍微按摩一下reset_index即可获得所需的结果:

This is a Series, while you want a DataFrame. A little massaging with reset_index gives you the desired result:

dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)

收益

    location   name
0  Amsterdam    Tom
1     Berlin    Jim
2      Paris    Jim
3  Amsterdam  Claus
4    Antwerp  Claus
5  Barcelona  Claus

遍历行并扩展pandas数据框

相关推荐