如何在 DataFrame 中用空值替换数字?

如何在 DataFrame 中用空值替换数字?

问题描述:

这可能很奇怪,但我想知道如何使用 null 替换任意数量的 DataFrameColumnnull>Scala.

It might be strange, but I was wondering how to replace any number of a whole DataFrame's Column for null using Scala.

想象一下,我有一个名为 col 的可空 DoubleType 列.在那里,我想用 null 替换所有与 (1.0 ~ 10.0) 不同的数字.

Imagine I have a nullable DoubleType column named col. There, I want to replace all numbers different to (1.0 ~ 10.0) by a null.

我对下一个代码的尝试不满意.

I tried unsatisfactorily the next code.

val xf = df.na.replace("col", Map(0.0 -> null.asInstanceOf[Double]).toMap)

但是,正如您在 Scala 中意识到的,当您将 null 转换为 Double 时,它会表示为 0.0代码>,这不是我想要的.此外,我无法意识到用一系列值来做到这一点.因此,我在想是否有任何方法可以实现这一目标?

But, as you realize in Scala when you convert a null into a Double it becomes represented as a 0.0, and this is not what I want. Besides, I can't realize any way to do it with a range of values. Therefore, I am thinking if there is any way to achieve this?

when 子句代替怎么样?

How about when clause instead?

import org.apache.spark.sql.functions.when

val df = sc.parallelize(
  (1L, 0.0) :: (2L, 3.6) :: (3L, 12.0) :: (4L, 5.0) ::  Nil
).toDF("id", "val")

df.withColumn("val", when($"val".between(1.0, 10.0), $"val")).show

// +---+----+
// | id| val|
// +---+----+
// |  1|null|
// |  2| 3.6|
// |  3|null|
// |  4| 5.0|
// +---+----+

任何不满足谓词(此处为 val BETWEEN 1.0 AND 10.0)的值将被替换为 NULL.

Any value which doesn't satisfy the predicate (here val BETWEEN 1.0 AND 10.0) will be replaced with NULL.

另见创建具有空/空字段值的新数据框