RDD中的分区数和Spark中的性能

问题描述:

在Pyspark中,我可以从列表中创建一个RDD并决定要拥有多少个分区:

In Pyspark, I can create a RDD from a list and decide how many partitions to have:

sc = SparkContext()
sc.parallelize(xrange(0, 10), 4)

我决定对RDD进行分区的分区数量如何影响性能? 以及这如何取决于计算机的核心数量?

How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has?

主要效果是指定的分区太少或 far 的分区太多.

The primary effect would be by specifying too few partitions or far too many partitions.

分区太少.您将不会利用群集中所有可用的核心.

Too few partitions You will not utilize all of the cores available in the cluster.

分区过多.管理许多小任务会产生过多开销.

Too many partitions There will be excessive overhead in managing many small tasks.

在两者之间,第一个对性能的影响要大得多.对于少于1000个的分区,在此时安排太多的Small任务影响相对较小.如果您拥有成千上万个分区,那么Spark会 非常 变慢

Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.