无法从Spark流中的单个文件读取流数据

问题描述:

我正在尝试从文本文件读取流数据,该文本文件使用Spark流API"textFileStream"连续添加.但是无法通过Spark流读取连续数据.如何在Spark中实现它?

I am trying to read the streaming data from the text file which gets appended continuously using Spark streaming API "textFileStream". But unable to read the continuous data with Spark streaming. How to achieve it in Spark?

这是预期的行为.对于基于文件的来源(例如):

This an expected behavior. For file based sources (like fileStream):

  • 必须通过原子移动文件或将其重命名到数据目录中来在dataDirectory中创建文件.
  • 一旦移动,不得更改文件.因此,如果文件被连续追加,将不会读取新数据.
  • The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
  • Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.

如果要连续读取附加内容,则必须创建自己的源代码,或使用单独的过程来监视更改,并将记录推送到例如Kafka(尽管很少将Spark与支持以下功能的文件系统结合使用附加).

If you want to read continuously appended you'll have to create your own source, or use separate process, which will monitor changes, and push records to for example Kafka (though it is rare to combine Spark with file systems that support appending).