Hadoop - 分布式缓存中的大文件

问题描述:

我有一个 4 GB 的文件,我试图通过分布式缓存在所有映射器之间共享.但我观察到地图任务尝试开始的显着延迟.具体来说,在我提交作业(通过 job.waitForCompletion())和第一张地图开始之间存在显着延迟.

I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts.

我想知道在 DistributedCache 中有大文件的副作用是什么.分布式缓存上的文件复制了多少次?集群中的节点数对此有影响吗?

I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have any effect on this ?

(我的集群有大约 13 个节点运行在非常强大的机器上,每台机器能够托管近 10 个地图插槽.)

(My cluster has about 13 nodes running on very powerful machines where each machine is able to host close to 10 map slots.)

谢谢

缓存"在这种情况下有点误导.您的 4 GB 文件将与 jars 和配置一起分发到每个任务.

"Cache" in this case is a bit misleading. Your 4 GB file will be distributed to every task along with the jars and configuration.

对于大于 200mb 的文件,我通常将它们直接放入文件系统并将复制设置为比通常复制更高的值(在您的情况下,我会将其设置为 5-7).您可以通过以下常用 FS 命令直接从每个任务的分布式文件系统中读取数据:

For files larger than 200mb I usually put them directly into the filesystem and set the replication to a higher value than the usual replication (in your case I would set this to 5-7). You can directly read from the distributed filesystem in every task by the usual FS commands like:

FileSystem fs = FileSystem.get(config);
fs.open(new Path("/path/to/the/larger/file"));

这样可以节省集群空间,同时也不应该延迟任务启动.但是,在非本地 HDFS 读取的情况下,它需要将数据流式传输到可能使用大量带宽的任务.

This saves space in the cluster, but also should not delay the task start. However, in case of non-local HDFS reads, it needs to stream the data to the task which might use a considerable amount of bandwidth.