pyspark:将DataFrame转换为RDD [string]

问题描述:

我想将pyspark.sql.dataframe.DataFrame转换为pyspark.rdd.RDD[String]

我将DataFrame df转换为RDD data:

I converted a DataFrame df to RDD data:

data = df.rdd
type (data)
## pyspark.rdd.RDD 

新的RDD data包含Row

first = data.first()
type(first)
## pyspark.sql.types.Row

data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')

我想将Row转换为String的列表,如下例所示:

I'd like to convert Row to list of String , like example below:

u'aaa',u'bbb',u'ccc',u'ddd'

谢谢

PySpark Row只是tuple,可以这样使用.您只需要使用list使用一个简单的map(如果您也想使行变平,就可以使用flatMap):

PySpark Row is just a tuple and can be used as such. All you need here is a simple map (or flatMap if you want to flatten the rows as well) with list:

data.map(list)

或者如果您期望使用其他类型:

or if you expect different types:

data.map(lambda row: [str(c) for c in row])