在Spark Dataframe中将字符串转换为日期
我有一个带有2个StringType
字段的数据框(df1
).
I have a dataframe (df1
) with 2 StringType
fields.
Field1(StringType)值-X
Field1 (StringType) Value-X
Field2(StringType)值-20180101
Field2 (StringType) value-20180101
我要做的就是从df1
中创建另一个具有2个字段的数据框(df2
)-
All I am trying to do is create another dataframe (df2
) from df1
with 2 fields-
Field1(StringType)值-X
Field1 (StringType) Value-X
Field2(日期类型)值-2018-01-01
Field2 (Date Type) Value-2018-01-01
我正在使用下面的代码-
I am using the below code-
df2=df1.select(
col("field1").alias("f1"),
unix_timestamp(col("field2"),"yyyyMMdd").alias("f2")
)
df2.show
df2.printSchema
对于此字段2,我尝试了多种操作-unix_timestamp
,from_unixtimestamp
,to_date
,cast(date)
,但没有任何作用
For this field 2, I tried multiple things - unix_timestamp
, from_unixtimestamp
, to_date
, cast("date")
but nothing worked
我需要以下模式作为输出:
I need the following schema as output:
df2.printSchema
|-- f1: string (nullable = false)
|-- f2: date (nullable = false)
我正在使用Spark 2.1
I'm using Spark 2.1
to_date
似乎可以很好地满足您的需求:
to_date
seems to work fine for what you need:
import org.apache.spark.sql.functions._
val df1 = Seq( ("X", "20180101"), ("Y", "20180406") ).toDF("c1", "c2")
val df2 = df1.withColumn("c2", to_date($"c2", "yyyyMMdd"))
df2.show
// +---+----------+
// | c1| c2|
// +---+----------+
// | X|2018-01-01|
// | Y|2018-04-06|
// +---+----------+
df2.printSchema
// root
// |-- c1: string (nullable = true)
// |-- c2: date (nullable = true)
[更新]
对于Spark 2.1或更低版本,to_date
不会将格式字符串作为参数,因此需要使用例如regexp_replace
的显式字符串格式设置为标准yyyy-MM-dd
格式:
For Spark 2.1 or prior, to_date
doesn't take format string as a parameter, hence explicit string formatting to the standard yyyy-MM-dd
format using, say, regexp_replace
is needed:
val df2 = df1.withColumn(
"c2", to_date(regexp_replace($"c2", "(\\d{4})(\\d{2})(\\d{2})", "$1-$2-$3"))
)