Spark:有条件地将列添加到数据框
我正在尝试获取我的输入数据:
I am trying to take my input data:
A B C
--------------
4 blah 2
2 3
56 foo 3
并根据B是否为空在末尾添加一列:
And add a column to the end based on whether B is empty or not:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
通过将输入数据帧注册为临时表,然后键入一个SQL查询,我可以轻松地做到这一点.
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
但是我真的很想知道如何仅使用Scala方法来执行此操作,而不必在Scala中键入SQL查询.
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
我已经尝试过.withColumn
,但是我无法做到这一点.
I've tried .withColumn
, but I can't get that to do what I want.
尝试使用功能when
的withColumn
,如下所示:
Try withColumn
with the function when
as follows:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show()
显示
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
我添加了(100, null, 5)
行以测试isNull
情况.
I added the (100, null, 5)
row for testing the isNull
case.
我在Spark 1.6.0
上尝试过此代码,但正如when
的代码中所述,它可以在1.4.0
之后的版本上使用.
I tried this code with Spark 1.6.0
but as commented in the code of when
, it works on the versions after 1.4.0
.