谁能解释执行程序中的rdd块
有人能解释为什么我第二次运行spark代码时rdd块为什么会增加,即使它们在第一次运行时存储在spark存储器中.我使用线程进行输入.rdd块的确切含义是什么. >
Can anyone explain why rdd blocks are increasing when i am running the spark code second time even though they are stored in spark memory during first run.I am giving input using thread.what is the exact meaning of rdd blocks.
我今天一直在对此进行研究,似乎RDD块是RDD块和非RDD块的总和. 在以下位置查看代码: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
I have been researching about this today and it seems RDD blocks is the sum of RDD blocks and non-RDD blocks. Check out the code at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
val rddBlocks = status.numBlocks
如果您转到Github上的Apache Spark Repo的以下链接,请执行以下操作: https://github.com/apache/spark/blob/d5b1d5fc80153571c308130833d0c0774de62c92/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala
And if you go to the below link of Apache Spark Repo on Github: https://github.com/apache/spark/blob/d5b1d5fc80153571c308130833d0c0774de62c92/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala
您将找到以下代码行:
/**
* Return the number of blocks stored in this block manager in O(RDDs) time.
*
* @note This is much faster than `this.blocks.size`, which is O(blocks) time.
*/
def numBlocks: Int = _nonRddBlocks.size + numRddBlocks
非rdd块是由广播变量创建的块,因为它们作为缓存的块存储在内存中.驱动程序通过广播变量将任务发送给执行者. 现在,这些系统创建的广播变量已通过ContextCleaner服务删除,因此,相应的非RDD块也被删除. 通过rdd.unpersist()可以不保留RDD块.
Non-rdd blocks are the ones created by broadcast variables as they are stored as cached blocks in memory. The tasks are sent by driver to the executors through broadcast variables. Now these system created broadcast variables are deleted through ContextCleaner service and consequently the corresponding non-RDD block is removed. RDD blocks are unpersisted through rdd.unpersist().