Hadoop MapReduce - 每个输入的一个输出文件

Hadoop MapReduce  - 每个输入的一个输出文件

问题描述:

我是Hadoop的新手,我试图弄清楚它是如何工作的。至于练习,我应该实现类似于WordCount-Example的东西。任务是读入几个文件,执行WordCount并为每个输入文件编写一个输出文件。
Hadoop使用组合器并将映射部分的输出作为reducer的输入进行混洗,然后写入一个输出文件(我猜是针对正在运行的每个实例)。我想知道是否有可能为每个输入文件写入一个输出文件(因此请保留inputfile1的文字并将结果写入outputfile1等等)。是否可以覆盖Combiner-Class或者是否有其他解决方案(我不确定这是否应该在Hadoop任务中解决,但这是练习)。

I'm new to Hadoop and I'm trying to figure out how it works. As for an exercise I should implement something similar to the WordCount-Example. The task is to read in several files, do the WordCount and write an output file for each input file. Hadoop uses a combiner and shuffles the output of the map-part as an input for the reducer, then writes one output file (I guess for each instance that is running). I was wondering if it is possible to write one output file for each input file (so keep the words of inputfile1 and write result to outputfile1 and so on). Is it possible to overwrite the Combiner-Class or is there another solution for this (I'm not sure if this should even be solved in a Hadoop-Task but this is the exercise).

谢谢...

map.input.file 环境参数具有映射器正在处理的文件名。在映射器中获取此值,并将其用作映射器的输出键,然后使用单个文件中的所有k / v转到一个reducer。

map.input.file environment parameter has the file name which the mapper is processing. Get this value in the mapper and use this as the output key for the mapper and then all the k/v from a single file to go to one reducer.

代码在映射器中。顺便说一句,我使用旧的MR API

The code in the mapper. BTW, I am using the old MR API

@Override
public void configure(JobConf conf) {
    this.conf = conf;
}

@Override.
public void map(................) throws IOException {

        String filename = conf.get("map.input.file");
        output.collect(new Text(filename), value);
}

使用MultipleOutputFormat,可以为作业编写多个输出文件。文件名可以从输出键和值中派生出来。

And use MultipleOutputFormat, this allows to write multiple output files for the job. The file names can be derived from the output keys and values.