如何在Spark RDD中添加新列?
问题描述:
我有一个带有 MANY 列的RDD(例如,数百),如何在该RDD的末尾再添加一列?
I have a RDD with MANY columns (e.g., hundreds), how do I add one more column at the end of this RDD?
例如,如果我的RDD如下:
For example, if my RDD is like below:
123, 523, 534, ..., 893
536, 98, 1623, ..., 98472
537, 89, 83640, ..., 9265
7297, 98364, 9, ..., 735
......
29, 94, 956, ..., 758
如何向其中添加一列,其值是第二和第三列的总和?
how can I add a column to it, whose value is the sum of the second and the third columns?
非常感谢您.
答
您完全不必使用Tuple
*对象向RDD
中添加新列.
You do not have to use Tuple
* objects at all for adding a new column to an RDD
.
可以通过映射每行,获取其原始内容以及要添加的元素来完成此操作,例如:
It can be done by mapping each row, taking its original contents plus the elements you want to append, for example:
val rdd = ...
val withAppendedColumnsRdd = rdd.map(row => {
val originalColumns = row.toSeq.toList
val secondColValue = originalColumns(1).asInstanceOf[Int]
val thirdColValue = originalColumns(2).asInstanceOf[Int]
val newColumnValue = secondColValue + thirdColValue
Row.fromSeq(originalColumns :+ newColumnValue)
// Row.fromSeq(originalColumns ++ List(newColumnValue1, newColumnValue2, ...)) // or add several new columns
})