有没有比收集更好的方法来读取Spark中的RDD了?

问题描述:

因此,我想读取RDD并将其放入一个数组中.为此,我可以使用 collect 方法.但是该方法确实很烦人,因为在我的情况下,它一直在给出kyro缓冲区溢出错误.如果我过多地设置了kyro缓冲区大小,它将开始出现自己的问题.另一方面,我注意到,如果仅使用 saveAsTextFile 方法将RDD保存到文件中,则不会出错.因此,我在想,必须有一些更好的方法来将RDD读入数组,这没有 collect 方法那样麻烦.

So, I want to read and RDD into an array. For that purpose, I could use the collect method. But that method is really annoying as in my case it keeps on giving kyro buffer overflow errors. If I set the kyro buffer size too much, it starts to have its own problems. On the other hand, I have noticed that if I just save the RDD into a file using the saveAsTextFile method, I get no errors. So, I was thinking, there must be some better method of reading an RDD into an array which isn't as problematic as the collect method.

否. collect是将RDD读入数组的唯一方法.

No. collect is the only method for reading an RDD into an array.

saveAsTextFile永远不必将所有数据收集到一台计算机上,因此它不受与collect相同的方式受一台计算机上可用内存的限制.

saveAsTextFile never has to collect all the data to one machine, so it is not limited by the available memory on a single machine in the same way that collect is.