YARN上的Spark:执行程序内存少于通过spark-submit设置的执行程序内存

问题描述:

我在具有以下设置的YARN群集(HDP 2.4)中使用Spark:

I'm using Spark in a YARN cluster (HDP 2.4) with the following settings:


  • 1个主节点


    • 64 GB RAM(可用48 GB)

    • 12核(可用8核)


    • 每个64 GB RAM(可用48 GB)

    • 每个12核(可用8核)


    • (一个主机的)所有容器的内存:48 GB

    • 最小容器大小=最大容器大小= 6 GB

    • 集群中的vcore = 40(5 x 8个工人核心)

    • 最小#vcores /容器=最大#vcores / container = 1

    • memory of all containers (of one host): 48 GB
    • minimum container size = maximum container size = 6 GB
    • vcores in cluster = 40 (5 x 8 cores of workers)
    • minimum #vcores/container = maximum #vcores/container = 1

    当我使用命令 spark-submit --num-executors 10 --executor-cores 1-运行我的spark应用程序时-executor-memory 5g ... Spark应该给每个执行者5 GB的RAM权限(由于开销约为10%,我将内存设置为5g)。

    When I run my spark application with the command spark-submit --num-executors 10 --executor-cores 1 --executor-memory 5g ... Spark should give each executor 5 GB of RAM right (I set memory only to 5g due to some overhead memory of ~10%).

    但是当我有在Spark UI中,我看到每个执行程序只有3.4 GB的内存,请参见屏幕截图:

    But when I had a look in the Spark UI, I saw that each executor only has 3.4 GB of memory, see screenshot:

    有人可以解释为什么分配的内存这么少吗?

    Can someone explain why there's so less memory allocated?

存储内存列UI中显示用于执行和RDD存储的内存量。默认情况下,这等于(HEAP_SPACE-300MB)* 75%。其余的内存用于内部元数据,用户数据结构和其他内容。

The storage memory column in the UI displays the amount of memory used for execution and RDD storage. By default, this equals (HEAP_SPACE - 300MB) * 75%. The rest of the memory is used for internal metadata, user data structures and other stuffs.

您可以通过设置 spark.memory来控制此数量。分数(不推荐)。请参见 Spark的文档

You can control this amount by setting spark.memory.fraction (not recommended). See more in Spark's documentation