在 Spark 中将字符串字段转换为时间戳的更好方法
我有一个 CSV,其中一个字段是特定格式的日期时间.我不能直接在我的 Dataframe 中导入它,因为它需要是一个时间戳.所以我将它作为字符串导入并将其转换为 Timestamp
像这样
I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp
like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
有没有更好、更简洁的方法来做到这一点,使用 Dataframe API 或 spark-sql?上面的方法需要创建一个RDD并再次为Dataframe提供schema.
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.
Spark >= 2.2
从 2.2 开始,您可以直接提供格式字符串:
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$@#@# |null |
// +---+-------------------+-------------------+
火花 >= 1.6,
您可以使用 Spark 1.5 中引入的日期处理函数.假设您有以下数据:
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$@#@#")).toDF("id", "dts")
您可以使用 unix_timestamp
来解析字符串并将其转换为时间戳
You can use unix_timestamp
to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$@#@# |null |
// +---+-------------------+---------------------+
如您所见,它涵盖了解析和错误处理.格式字符串应与 Java 兼容 SimpleDateFormat
.
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat
.
火花 >= 1.5,
你必须使用这样的东西:
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
或
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
由于 SPARK-11724.
火花
您应该能够将这些与 expr
和 HiveContext
一起使用.
you should be able to use these with expr
and HiveContext
.