在 pandas 数据框中将单元格拆分为多行
我有一个包含订单数据的数据框,每个订单都有多个包存储为逗号分隔的字符串 [package
&package_code
] 列
I have a dataframe contains orders data, each order has multiple packages stored as comma separated string [package
& package_code
] columns
我想拆分包裹数据并为每个包裹创建一行,包括其订单详细信息
I want to split the packages data and create a row for each package including its order details
这是一个示例输入数据框:
Here is a sample input dataframe:
import pandas as pd
df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1,p2,p3","p4","p5,p6"],"package_code":["#111,#222,#333","#444","#555,#666"]})
这就是我想要实现的输出:
And this is what I am trying to achieve as output:
我怎样才能用熊猫做到这一点?
How can I do that with pandas?
这是使用 numpy.repeat
和 itertools.chain
的一种方法.从概念上讲,这正是您想要做的:重复某些值,链接其他值.推荐用于少量列,否则基于 stack
的方法可能会更好.
Here's one way using numpy.repeat
and itertools.chain
. Conceptually, this is exactly what you want to do: repeat some values, chain others. Recommended for small numbers of columns, otherwise stack
based methods may fare better.
import numpy as np
from itertools import chain
# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split(',')))
# calculate lengths of splits
lens = df['package'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'order_id': np.repeat(df['order_id'], lens),
'order_date': np.repeat(df['order_date'], lens),
'package': chainer(df['package']),
'package_code': chainer(df['package_code'])})
print(res)
order_id order_date package package_code
0 1 20/5/2018 p1 #111
0 1 20/5/2018 p2 #222
0 1 20/5/2018 p3 #333
1 3 22/5/2018 p4 #444
2 7 23/5/2018 p5 #555
2 7 23/5/2018 p6 #666