MapReduce中1个任务的减速器数量

问题描述:

在典型的MapReduce设置(如Hadoop)中,一个任务(例如,计数单词)使用多少个reducer?我对Google的MapReduce的理解意味着仅涉及1个reducer.正确吗?

In a typical MapReduce setup(like Hadoop), how many reducer is used for 1 task, for example, counting words? My understanding of that MapReduce from Google means only 1 reducer is involved. Is that correct?

例如,单词计数会将输入划分为N个块,并且N Map将运行,从而生成(word,#)列表. 我的问题是,在Map阶段完成后,是否将仅运行一个reducer实例来计算结果?还是会有减速器并行运行?

For example, the word count will divide the input into N chunks, and N Map will be running, producing the (word,#) list. My question is, once the Map phase is done, will there be only ONE reducer instance running to compute the result? or there will be reducers running in parallel?

简单的答案是:减速器的数量不必为1,是的,减速器可以并行运行.正如我上面提到的,这是用户定义或派生的.

The simple answer is that the number of reducers does not have to be 1 and yes, reducers can run in parallel. As I mentioned above this is user defined or derived.

为使内容保持上下文,在这种情况下,我将参考Hadoop,以便您了解事物的工作方式.如果您在Hadoop(0.20.2)中使用流API,则必须明确定义要运行多少个reducer,因为默认情况下,将仅启动1个reduce任务.您可以通过将减速器的数量传递给-D mapred.reduce.tasks=# of reducers参数来实现. Java API将尝试导出您所需的化简器数量,但同样可以显式设置它.在这两种情况下,每个节点上可以运行的reducer数量都有一个硬性上限,该上限是在mapred-site.xml配置文件中使用mapred.tasktracker.reduce.tasks.maximum设置的.

To keep things in context I will refer to Hadoop in this case so you have an idea of how things work. If you are using the streaming API in Hadoop (0.20.2) you will have to explicitly define how many reducers you would like to run since by default, only 1 reduce task will be launched. You do so by passing the number of reducers to the -D mapred.reduce.tasks=# of reducers argument. The Java API will try to derive the number of reducers you will need but again you can explicitly set that too. In both cases, there is a hard cap on the number of reducers you can run per node and that is set in your mapred-site.xml configuration file using mapred.tasktracker.reduce.tasks.maximum.

从概念上讲,您可以查看hadoop Wiki上的帖子,该帖子讨论了选择映射并减少任务.

On a more conceptual note, you can look at this post on the hadoop wiki that talks about choosing the number of map and reduce tasks.