Scala:如何根据行数将数据帧拆分为多个csv文件

问题描述：

我有一个数据帧，说是df1，具有1000万行.我想将其拆分为多个csv文件，每个文件包含1M行.有没有建议在Scala中做同样的事情?

I have a dataframe say df1 with 10M rows. I want to split the same to multiple csv files with 1M rows each. Any suggestions to do the same in scala?

答

您可以在数据框上使用randomSplit方法.

You can use the randomSplit method on Dataframes.

import scala.util.Random
val df = List(0,1,2,3,4,5,6,7,8,9).toDF
val splitted = df.randomSplit(Array(1,1,1,1,1)) 
splitted foreach { a => a.write.format("csv").save("path" + Random.nextInt) }

我使用Random.nextInt有一个唯一的名称.您可以根据需要在其中添加其他逻辑.

I used the Random.nextInt to have a unique name. You can add some other logic there if necessary.

来源:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

如何将Spark DataFrame保存为csv磁盘?

另一种方法是使用限制，但以下条件除外:

var input = List(1,2,3,4,5,6,7,8,9).toDF
val limit = 2

var newFrames = List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]()
var size = input.count;

while (size > 0) {
    newFrames = input.limit(limit) :: newFrames
    input = input.except(newFrames.head)
    size = size - limit
}

newFrames.foreach(_.show)

结果列表中的第一个元素包含的元素可能少于列表中其余元素.

The first element in the resulting list may contain less element than the rest of the list.

Scala:如何根据行数将数据帧拆分为多个csv文件

相关推荐