将标准的python键值字典列表转换为pyspark数据框

问题描述：

考虑一下，我有一个python字典键值对的列表，其中键对应于表的列名，因此在下面的列表中，如何将其转换为具有两个cols arg1 arg2的pyspark数据帧?

Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?

 [{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]

我如何使用以下结构来做到这一点?

How can i use the following construct to do it?

df = sc.parallelize([
    ...
]).toDF

在上面的代码(...)中将arg1 arg2放在何处

Where to place arg1 arg2 in the above code (...)

答

旧方法:

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()

新方法:

from pyspark.sql import Row
from collections import OrderedDict

def convert_to_row(d: dict) -> Row:
    return Row(**OrderedDict(sorted(d.items())))

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
    .map(convert_to_row) \ 
    .toDF()

将标准的python键值字典列表转换为pyspark数据框

相关推荐