Spark和不可序列化的DateTimeFormatter

问题描述:

我试图在Spark中使用java.time.format中的DateTimeFormatter,但它似乎不可序列化.这是相关的代码块:

I'm trying to use the DateTimeFormatter from java.time.format in Spark but it appears to be not serializable. This is the relevant chunk of code:

val pattern = "<some pattern>".r
val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")

val logs = sc.wholeTextFiles(path)

val entries = logs.flatMap(fileContent => {
    val file = fileContent._1
    val content = fileContent._2
    content.split("\\r?\\n").map(line => line match {
      case pattern(dt, ev, seq) => Some(LogEntry(LocalDateTime.parse(dt, dtFormatter), ev, seq.toInt))
      case _ => logger.error(s"Cannot parse $file: $line"); None
    })
  })

如何避免java.io.NotSerializableException: java.time.format.DateTimeFormatter异常?是否有更好的库来解析时间戳?我已经读到Joda也不能序列化,并且已被合并到Java 8的时间库中.

How can I avoid the java.io.NotSerializableException: java.time.format.DateTimeFormatter exception? Is there a better library to parse timestamps? I've read that Joda is also not serializable and has been incorporated in Java 8's time library.

您可以通过两种方式避免序列化:

You can avoid serialization in two ways:

  1. 假设其值可以恒定,请将格式化程序放在object中(使其成为静态").这意味着可以在每个工作程序中访问静态值,而不是驱动程序将其序列化并发送给工作程序:

  1. Assuming its value can be constant, place the formatter in an object (making it "static"). This would mean that the static value can be accessed within each worker, instead of the driver serializing it and sending to worker:

object MyUtils {
  val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
}

import MyUtils._
logs.flatMap(fileContent => {
  // can safely use formatter here
})

  • 在匿名函数内的每个记录实例化它.这会带来一些性能损失(因为实例化将根据记录反复进行),因此仅在无法应用第一个选项的情况下才使用此选项:

  • instantiate it per record inside the anonymous function. This carries some performance penalty (as the instantiation will happen over and over, per record), so only use this option if the first can't be applied:

    logs.flatMap(fileContent => {
      val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
      // use formatter here
    })