RDD 中的分区数和 Spark 中的性能

问题描述:

在 Pyspark 中,我可以从一个列表中创建一个 RDD 并决定有多少个分区:

In Pyspark, I can create a RDD from a list and decide how many partitions to have:

sc = SparkContext()
sc.parallelize(xrange(0, 10), 4)

我决定对 RDD 进行分区的分区数量如何影响性能?这如何取决于我的机器拥有的核心数量?

How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has?

主要效果是指定的分区太少或分区太多.

The primary effect would be by specifying too few partitions or far too many partitions.

分区太少您将无法利用集群中的所有可用核心.

Too few partitions You will not utilize all of the cores available in the cluster.

分区太多管理很多小任务会产生过多的开销.

Too many partitions There will be excessive overhead in managing many small tasks.

在两者之间,第一个对性能的影响要大得多.对于分区计数低于 1000 的情况,此时调度过多的 smalls 任务的影响相对较小.如果您有数万个分区,那么 spark 会非常.

Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.