Pyspark将结构数组转换为字符串
我在 Pyspark 中有以下数据框
I have the following dataframe in Pyspark
+----+-------+-----+
|name|subject|score|
+----+-------+-----+
| Tom| math| 90|
| Tom|physics| 70|
| Amy| math| 95|
+----+-------+-----+
我使用了 pyspark.sql.functions
df.groupBy('name').agg(collect_list(struct('subject', 'score')).alias('score_list'))
获取以下数据框
+----+--------------------+
|name| score_list|
+----+--------------------+
| Tom|[[math, 90], [phy...|
| Amy| [[math, 95]]|
+----+--------------------+
我的问题是如何将最后一列 score_list
转换为字符串并将其转储到 csv 文件中,如下所示
My question is how can I transform the last column score_list
into string and dump it into a csv file looks like
Tom (math, 90) | (physics, 70)
Amy (math, 95)
感谢您的帮助,谢谢.
更新:这里是一个类似的问题,但并不完全相同因为它直接从 string
到另一个 string
.就我而言,我想首先将 string
传输到 collect_list
并最终将这个 collect_list
.
Update: Here is a similar question but it's not exactly the same because it goes directly from string
to another string
. In my case, I want to first transfer string
to collect_list<struct>
and finally stringify this collect_list<struct>
.
我链接的重复项并不能完全回答您的问题,因为您正在合并多个列.不过,您可以很容易地修改解决方案以适应您想要的输出.
The duplicates I linked don't exactly answer your question, since you're combining multiple columns. Nevertheless you can modify the solutions to fit your desired output quite easily.
只需将 struct
替换为 concat_ws
.也使用 concat
添加左括号和右括号以获得您想要的输出.
Just replace the struct
with concat_ws
. Also use concat
to add an opening and closing parentheses to get the output you desire.
from pyspark.sql.functions import concat, concat_ws, lit
df = df.groupBy('name')\
.agg(
concat_ws(
" | ",
collect_list(
concat(lit("("), concat_ws(", ", 'subject', 'score'), lit(")"))
)
).alias('score_list')
)
df.show(truncate=False)
#+----+--------------------------+
#|name|score_list |
#+----+--------------------------+
#|Tom |(math, 90) | (physics, 70)|
#|Amy |(math, 95) |
#+----+--------------------------+
请注意,由于逗号出现在 score_list
列中,如果您使用默认参数,则在写入 csv
时将引用此值.
Note that since the comma appears in the score_list
column, this value will be quoted when you write to csv
if you use the default arguments.
例如:
df.coalesce(1).write.csv("test.csv")
将产生以下输出文件:
Tom,"(math, 90) | (physics, 70)"
Amy,"(math, 95)"