使用 withColumn 向现有 DataFrame 添加两列

问题描述：

我有一个带有几列的 DataFrame.现在我想在现有的 DataFrame 中再添加两列.

I have a DataFrame with a few columns. Now I want to add two more columns to the existing DataFrame.

目前我正在使用 DataFrame 中的 withColumn 方法执行此操作.

Currently I am doing this using withColumn method in DataFrame.

例如:

df.withColumn("newColumn1", udf(col("somecolumn")))
  .withColumn("newColumn2", udf(col("somecolumn")))

实际上，我可以使用 Array[String] 在单个 UDF 方法中返回两个 newcoOlumn 值.但目前我就是这样做的.

Actually I can return both newcoOlumn values in single UDF method using Array[String]. But currently this is how I am doing it.

无论如何，我可以有效地做到这一点吗?使用 explode 是这里的好选择吗?

Is there anyway, I can do this effectively? using explode is the good option here?

即使我必须使用explode，我也必须使用withColumn一次，然后将列值返回为Array[String]，然后使用 explode，再创建两列.

Even if I have to use explode, I have to use withColumn once, then return the column value as Array[String], then using explode, create two more columns.

哪个有效?或者有其他选择吗?

Which one is effective? or is there any alternatives?

答

AFAIk 你需要调用 withColumn 两次(每个新列调用一次).但是如果你的 udf 在计算上很昂贵，你可以避免调用它两次，将复杂"结果存储在一个临时列中，然后解包"结果，例如使用列的 apply 方法(它可以访问数组元素).请注意，有时需要缓存中间结果(以防止在解包期间每行调用 UDF 两次)，有时则不需要.这似乎取决于如何激发优化计划:

AFAIk you need to call withColumn twice (once for each new column). But if your udf is computationally expensive, you can avoid to call it twice with storing the "complex" result in a temporary column and then "unpacking" the result e.g. using the apply method of column (which gives access to the array element). Note that sometimes it's necessary to cache the intermediate result (to prevent that the UDF is called twice per row during unpacking), sometimes it's not needed. This seems to depend on how spark the optimizes the plan :

val myUDf = udf((s:String) => Array(s.toUpperCase(),s.toLowerCase()))

val df = sc.parallelize(Seq("Peter","John")).toDF("name")

val newDf = df
  .withColumn("udfResult",myUDf(col("name"))).cache 
  .withColumn("uppercaseColumn", col("udfResult")(0))
  .withColumn("lowercaseColumn", col("udfResult")(1))
  .drop("udfResult")

newDf.show()

给予

+-----+---------------+---------------+
| name|uppercaseColumn|lowercaseColumn|
+-----+---------------+---------------+
|Peter|          PETER|          peter|
| John|           JOHN|           john|
+-----+---------------+---------------+

当 UDF 返回一个元组时，解包看起来像这样:

With an UDF returning a tuple, the unpacking would look like this:

val newDf = df
    .withColumn("udfResult",myUDf(col("name"))).cache
    .withColumn("lowercaseColumn", col("udfResult._1"))
    .withColumn("uppercaseColumn", col("udfResult._2"))
    .drop("udfResult")

使用 withColumn 向现有 DataFrame 添加两列

相关推荐