展开 pandas 数据框列
我有一个看起来像这样的Pandas Dataframe:
I have a Pandas Dataframe that looks something like this:
text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]
labels = ["label_1, label_2",
"label_1, label_3, label_2",
"label_2, label_4",
"label_1, label_2, label_5",
"label_2, label_3",
"label_3, label_5, label_1, label_2",
"label_1, label_3"]
df = pd.DataFrame(dict(text=text, labels=labels))
df
text labels
0 abcd label_1, label_2
1 efgh label_1, label_3, label_2
2 ijkl label_2, label_4
3 mnop label_1, label_2, label_5
4 qrst label_2, label_3
5 uvwx label_3, label_5, label_1, label_2
6 yz label_1, label_3
我想将数据框格式化为以下格式:
I would like to format the dataframe into something like this:
text label_1 label_2 label_3 label_4 label_5
abcd 1.0 1.0 0.0 0.0 0.0
efgh 1.0 1.0 1.0 0.0 0.0
ijkl 0.0 1.0 0.0 1.0 0.0
mnop 1.0 1.0 0.0 0.0 1.0
qrst 0.0 1.0 1.0 0.0 0.0
uvwx 1.0 1.0 1.0 0.0 1.0
yz 1.0 0.0 1.0 0.0 0.0
我该怎么做?
(我知道我可以通过执行df.labels.str.split(",")
之类的操作来分割标签中的字符串并将其转换为列表,但是不确定如何从那里开始.
How can I accomplish this?
(I know I can split the strings in the labels and convert them into lists by doing something like df.labels.str.split(",")
but not sure as to how to proceed from there.
(因此,基本上,我想将标签列中的那些关键字转换成其自己的列,并在它们出现在预期输出中时填充为1)
(so basically I'd like to convert those keywords in the labels columns into its own columns and fill in 1 whenever they appear as shown in expected output)
You can use pd.Series.str.get_dummies
and combine with the text
series:
dummies = df['labels'].str.replace(' ', '').str.get_dummies(',')
res = df['text'].to_frame().join(dummies)
print(res)
text label_1 label_2 label_3 label_4 label_5
0 abcd 1 1 0 0 0
1 efgh 1 1 1 0 0
2 ijkl 0 1 0 1 0
3 mnop 1 1 0 0 1
4 qrst 0 1 1 0 0
5 uvwx 1 1 1 0 1
6 yz 1 0 1 0 0