有没有比收集更好的方法来读取Spark中的RDD了?

问题描述：

因此，我想读取RDD并将其放入一个数组中.为此，我可以使用 collect 方法.但是该方法确实很烦人，因为在我的情况下，它一直在给出kyro缓冲区溢出错误.如果我过多地设置了kyro缓冲区大小，它将开始出现自己的问题.另一方面，我注意到，如果仅使用 saveAsTextFile 方法将RDD保存到文件中，则不会出错.因此，我在想，必须有一些更好的方法来将RDD读入数组，这没有 collect 方法那样麻烦.

So, I want to read and RDD into an array. For that purpose, I could use the collect method. But that method is really annoying as in my case it keeps on giving kyro buffer overflow errors. If I set the kyro buffer size too much, it starts to have its own problems. On the other hand, I have noticed that if I just save the RDD into a file using the saveAsTextFile method, I get no errors. So, I was thinking, there must be some better method of reading an RDD into an array which isn't as problematic as the collect method.

答

否. collect是将RDD读入数组的唯一方法.

No. collect is the only method for reading an RDD into an array.

saveAsTextFile永远不必将所有数据收集到一台计算机上，因此它不受与collect相同的方式受一台计算机上可用内存的限制.

saveAsTextFile never has to collect all the data to one machine, so it is not limited by the available memory on a single machine in the same way that collect is.

有没有比收集更好的方法来读取Spark中的RDD了?

相关推荐