MapReduce Design Patterns（6. Job链）（十一） Chapter 6. Meta patterns

http://blog.****.net/cuirong1986/article/details/8492804

这种模式不是解决某个问题的，而是处理模式的关系的。可以理解为“模式的模式”。首先讨论的是job链，把几个模式联合起来解决复杂的，有多个阶段要处理的问题。第二个是job 合并，用相同的MapReduce job执行多个分析的优化，达到一箭多雕的目的。

Job chaining

理解job链接和对job链接的操作计划非常重要。很多人发现用单独一个MapReduce job不能解决一个问题。需要一连串的job需要跑，一些需要其它job的输出。一旦你开始熟悉用一些列MapReduce job解决问题时，你就进入了一个新的挑战阶段。

Job链接是一个较难处理的过程，因为它不是MapReduce 框架里确定的特性。像hadoop这样的系统设计成处理一个MapReduce job会容易做，但处理一个有多个阶段要执行的job需要大量的工作量。需要考虑的有，某一阶段出错的job，要清楚掉中间输出。这一部分将会讨论几个不同的处理job 链接的方法。有一些对你的需求可能很适合，每一种都有利弊。

几个框架和工具已经应运而生来填补这项应用。如果你做大量的工作流并且很复杂。你应该考虑使用其中一个。这里描述的方法是轻量级的，且需要实现为一种串行模式。Oozie是apache的开源项目，有创建工作流并协调job运行的功能。创建job链是其中的一项工作，

并且对操作运行hadoop MapReduce job非常有用。

使用MapReduce的一个共同的缺陷是数据太小没必要分布式运行。如果你认为链接两个job是正确的选择，要考虑第一个job有多少输出量。如果有大量的输出数据，尽量使用第二个MapReduce job。很多时候，Job的输出文件很小就可以在单节点上高效的执行。这两种方式是：或者在job完成后，在驱动代码里通过文件系统加载数据，或者用某种脚本封装在一起。

Notice：MapReduce链的主要问题是临时文件的大小。有时比较小，可能导致大量的map 任务。在非链式job中，reducer的数量通常依赖于接受到的数据量的大小而不是输出的数量。当使用链时，输出文件的大小就很重要，甚至reducer要运行很长时间。争取输出文件时分布式系统中一个块的大小。尝试不同的reducer的数量，并看看影响性能的瓶颈。

另一种选择是使用CombineFileInputFormat来加载断断续续的输出数据。它会把小数据合成一个大的输入分片进行下面的mapper处理。

With the Driver

可能最简单的执行job链的方法是用主驱动代码来简单的驱动多个与具体job对应的驱动代码。没有特别的地方，java中用得很广泛。它不跟某种类或其它什么东西绑定。

通过顺序调用job的驱动代码让job按指定的顺序执行。你必须确保第一个job的输出路径是第二个job的输入路径，可以通过共享临时目录变量的方式实现。

在生产环境下，这个临时目录应该被清理，job完成后就不应该存在。缺乏规律的处理，会使你的集群资源很快用完。也要小心你要创建的临时数据量，因为他们要存储到文件系统中。

用能很容易推断这种途径创建的链会比简单执行两个job所用时间长。注意跟踪临时目录，并视情况清除那些job不再用的数据。

你可以使用Job.submit()代替Job.waitForCompletion()并行执行job。Submit方法会立刻返回，并启动一个后台程序执行job。这允许一次执行多个job。使用Job.isComplete()，非阻塞的检查job是否完成，经常使用。

另一件要注意的事情是job是否成功。仅仅知道job是否完成是不好的。需要检查成功与否。如果依赖job失败，应该停止整个链，而不是让它继续执行。

很明显从软件工程的角度管理和维护这个执行过程是非常困难的。因为job链很复杂。这也是像jobControl或者oozie出现的原因。

Job Chaining Examples

Basic job chaining

这个例子的目的是输出一对信息：声誉值和发帖数。这可以在一个MapReduce job里完成，但我们要根据发帖数的平均值把用户分成两部分。我们需要一个job统计数据，另一个基于平均值把用户分成两部分。这里将用到4中模式：数值聚合，计数，分箱，复制join。

使用框架的计数器计算发帖的平均数。第二个job中用户数据放入分布式缓存从而使输出数据带有用户声誉值。这种改进是为了适合下一个例子，计算用户的平均声誉值，分成两个箱（大于或小于平均值）。

问题：给出stackOverflow 用户发帖数据，把用户分成两部分，根据高于或低于发帖数的平均值。并且丰富用户信息，加上从另一个数据集获得的声誉值，然后输出。

Job one mapper。在我们看驱动代码之前，先理解下两个job的mapper和reducer。Mapper通过从每条记录指定的OwnerUserId 属性的值记录user id，并作为job的输出key，value为1。记录计数器也会增1.这个value随后会在驱动中用来计算用户的平均发帖数。AVERAGE_CALC_GROUP 是一个public static 驱动级别的string。

public static class UserIdCountMapper extends
       Mapper<Object, Text, Text, LongWritable> {
    public static final String RECORDS_COUNTER_NAME = "Records";
    private static final LongWritable ONE = new LongWritable(1);
    private Text outkey = new Text();
 
    public void map(Object key, Text value, Context context)
           throws IOException, InterruptedException {
       Map<String, String> parsed = MRDPUtils.transformXmlToMap(value
              .toString());
       String userId = parsed.get("OwnerUserId");
       if (userId != null) {
           outkey.set(userId);
           context.write(outkey, ONE);
           context.getCounter(AVERAGE_CALC_GROUP, RECORDS_COUNTER_NAME)
                  .increment(1);
       }
    }
}

Job one reducer。Reducer也相对简单。只是迭代输入values，计算sum值，作为值跟输入key作为key一同输出。一个不同的计数器会对每个reduce自增，为了计算平均值。

public static class UserIdSumReducer extends
       Reducer<Text, LongWritable, Text, LongWritable> {
    public static final String USERS_COUNTER_NAME = "Users";
    private LongWritable outvalue = new LongWritable();
 
    public void reduce(Text key, Iterable<LongWritable> values,
           Context context) throws IOException, InterruptedException {
       // Increment user counter, as each reduce group represents one user
       context.getCounter(AVERAGE_CALC_GROUP, USERS_COUNTER_NAME)
              .increment(1);
       int sum = 0;
       for (LongWritable value : values) {
           sum += value.get();
       }
       outvalue.set(sum);
       context.write(key, outvalue);
    }
}

Job two mapper。比前面的job稍复杂。这里做了几个不同的事情得到期望的输出。Setup阶段完成三件事情。发帖的平均值从job配置阶段设置的context对象中取出来。初始化MultipleOutputs，用来把输出写到不同的箱。最后，从DistributedCache解析用户数据，创建一个user id对应声誉值的map。用于数据丰富的目的。

跟setup阶段相比这个map方法相对容易。解析输入值得到user id和发帖数。只需要用tab 分割输入value，取得前两个字段。然后设置输出key为user id，输出值为发帖数和用户声誉值，靠tab分割。用户发帖数跟平均值作比较，对用户完成分箱。

可选的第四个参数MultipleOutputs.write用于命名输出文件。一个常量用来指定用户的目录，根据用户的发帖数是在平均值之上或之下。目录里的文件名增加了额外的字符串“/part”，作为文件名的开始，然后框架会自动追加上-m-nnnn。Nnnn代表任务id。用这中命名，针对对两个箱会创建目录，并且每个目录里包含部分文件。这样做是便于下一个例子并行执行job时的输入输出的管理。

最后，cleanup阶段关闭MultipleOutputs。

public static class UserIdBinningMapper extends
       Mapper<Object, Text, Text, Text> {
    public static final String AVERAGE_POSTS_PER_USER = "avg.posts.per.user";
 
    public static void setAveragePostsPerUser(Job job, double avg) {
       job.getConfiguration().set(AVERAGE_POSTS_PER_USER,
              Double.toString(avg));
    }
 
    public static double getAveragePostsPerUser(Configuration conf) {
       return Double.parseDouble(conf.get(AVERAGE_POSTS_PER_USER));
    }
 
    private doubleaverage = 0.0;
    private MultipleOutputs<Text, Text> mos = null;
    private Text outkey = new Text(), outvalue = new Text();
    private HashMap<String, String> userIdToReputation = new HashMap<String, String>();
 
    protected void setup(Context context) throws IOException,
           InterruptedException {
       average = getAveragePostsPerUser(context.getConfiguration());
       mos = new MultipleOutputs<Text, Text>(context);
       Path[] files = DistributedCache.getLocalCacheFiles(context
              .getConfiguration());
       // Read all files in the DistributedCache
       for (Path p : files) {
           BufferedReader rdr = new BufferedReader(new InputStreamReader(
                  new GZIPInputStream(new FileInputStream(new File(
                         p.toString())))));
           String line;
           // For each record in the user file
           while ((line = rdr.readLine()) != null) {
              // Get the user ID and reputation
              Map<String, String> parsed = MRDPUtils
                     .transformXmlToMap(line);
              // Map the user ID to the reputation
              userIdToReputation.put(parsed.get("Id"),
                     parsed.get("Reputation"));
           }
       }
    }
 
    public void map(Object key, Text value, Context context)
           throws IOException, InterruptedException {
       String[] tokens = value.toString().split("	");
       String userId = tokens[0];
       int posts = Integer.parseInt(tokens[1]);
       outkey.set(userId);
       outvalue.set((long) posts + "	" + userIdToReputation.get(userId));
       if ((double) posts < average) {
           mos.write(MULTIPLE_OUTPUTS_BELOW_NAME, outkey, outvalue,
                  MULTIPLE_OUTPUTS_BELOW_NAME + "/part");
       } else {
           mos.write(MULTIPLE_OUTPUTS_ABOVE_NAME, outkey, outvalue,
                  MULTIPLE_OUTPUTS_ABOVE_NAME + "/part");
       }
    }
 
    protected void cleanup(Context context) throws IOException,
           InterruptedException {
       mos.close();
    }
}

Driver code。下面看最复杂的驱动代码。分解为两部分讨论：第一个job和第二个job。第一个job解析命令行参数创建合适的输入输出目录。创建的临时目录会在job链的最后由驱动代码删掉。

Notice:输出目录名字附加一个string作为中间输出目录。这在大多数情况下是合适的，但如果对中间目录有一个命名约定来避免冲突会更好。Job提交时如果输出目录已经存在，job将不会启动。

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Path postInput = new Path(args[0]);
    Path userInput = new Path(args[1]);
    Path outputDirIntermediate = new Path(args[2] + "_int");
    Path outputDir = new Path(args[2]);
    // Setup first job to counter user posts
    Job countingJob = new Job(, "JobChaining-Counting");
    countingJob.setJarByClass(JobChainingDriver.class);
    // Set our mapper and reducer, we can use the API's long sum reducer for
    // a combiner!
    countingJob.setMapperClass(UserIdCountMapper.class);
    countingJob.setCombinerClass(LongSumReducer.class);
    countingJob.setReducerClass(UserIdSumReducer.class);
    countingJob.setOutputKeyClass(Text.class);
    countingJob.setOutputValueClass(LongWritable.class);
    countingJob.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(countingJob, postInput);
    countingJob.setOutputFormatClass(TextOutputFormat.class);
    TextOutputFormat.setOutputPath(countingJob, outputDirIntermediate);
    // Execute job and grab exit code
    int code = countingJob.waitForCompletion(true) ? 0 : 1;
   。。。

执行第二个job之前要检测第一个job是否成功。这看起来足够简单，但对于更复杂的job链，检测是比较烦人的。第二个job配置之前，从第一个job抽取代表平均发帖数的计数器的值，加到job配置里。然后设置mapper并禁用reducer阶段。另外的关键部分要注意的是MultipleOutputs和DistributedCache的配置。然后job执行

最后，最终要的是成功或失败，中间输出目录被清除。这是一个重要并经常被忽视的阶段。留下中间输出目录会很快的填满集群，需要你手动删除这些目录。不需要的就删掉。

if (code == 0) {
    // Calculate the average posts per user by getting counter values
    double numRecords = (double) countingJob
    .getCounters()
    .findCounter(AVERAGE_CALC_GROUP,
    UserIdCountMapper.RECORDS_COUNTER_NAME).getValue();
    double numUsers = (double) countingJob
    .getCounters()
    .findCounter(AVERAGE_CALC_GROUP,
    UserIdSumReducer.USERS_COUNTER_NAME).getValue();
    double averagePostsPerUser = numRecords / numUsers;
    // Setup binning job
    Job binningJob = new Job(new Configuration(), "JobChaining-Binning");
    binningJob.setJarByClass(JobChainingDriver.class);
    // Set mapper and the average posts per user
    binningJob.setMapperClass(UserIdBinningMapper.class);
    UserIdBinningMapper.setAveragePostsPerUser(binningJob,
    averagePostsPerUser);
    binningJob.setNumReduceTasks(0);
    binningJob.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(binningJob, outputDirIntermediate);
    // Add two named outputs for below/above average
    MultipleOutputs.addNamedOutput(binningJob,
    MULTIPLE_OUTPUTS_BELOW_NAME, TextOutputFormat.class,
    Text.class, Text.class);
    MultipleOutputs.addNamedOutput(binningJob,
    MULTIPLE_OUTPUTS_ABOVE_NAME, TextOutputFormat.class,
    Text.class, Text.class);
    MultipleOutputs.setCountersEnabled(binningJob, true);
    TextOutputFormat.setOutputPath(binningJob, outputDir);
    // Add the user files to the DistributedCache
    FileStatus[] userFiles = FileSystem.get(conf).listStatus(userInput);
    for (FileStatus status : userFiles) {
    DistributedCache.addCacheFile(status.getPath().toUri(),
    binningJob.getConfiguration());
    }
    // Execute job and grab exit code
    code = binningJob.waitForCompletion(true) ? 0 : 1;
    }
    // Clean up the intermediate output
    FileSystem.get(conf).delete(outputDirIntermediate, true);
    System.exit(code);

Parallel job chaining

并行job链的驱动跟前面例子的相似。唯一大的改进是jobs被并行提交然后监控它们直到完成。本例中的两个job是独立的（当然，用到了前面例子的输出）。这增加了更好利用集群资源的好处，能同时运行两个job。

问题：用到前面例子产生的分箱的用户数据，在两个箱上同时跑job计算平均声誉值。

Mapper code。Mapper分割输入值为字符串数组。第三个索引值是该用户的声誉值。这个值是随着唯一key输出的。为了分组所有的声誉值计算平均值，这个key通过所有的map任务共享，nullwritable能用，但我们需要一个有意义的表示。

Notice：对非常大的数据集这个执行会很昂贵。因为只有一个reducer负责所有的中间键值对通过网络传输。从一个节点连续读数据带来的好处是，输入分片被并行读，reducer数量可配置。

public static class AverageReputationMapper extends
       Mapper<LongWritable, Text, Text, DoubleWritable> {
    private static final Text GROUP_ALL_KEY = new Text(
           "Average Reputation:");
    private DoubleWritable outvalue = new DoubleWritable();
 
    protected void map(LongWritable key, Text value, Context context)
           throws IOException, InterruptedException {
       // Split the line into tokens
       String[] tokens = value.toString().split("	");
       // Get the reputation from the third column
       double reputation = Double.parseDouble(tokens[2]);
       // Set the output value and write to context
       outvalue.set(reputation);
       context.write(GROUP_ALL_KEY, outvalue);
    }
}

Reducer code。Reducer简单的迭代声誉值，求声誉值和，求用户个数，然后相除得到平均值，平均值随着输入key一同输出。

public static class AverageReputationReducer extends
       Reducer<Text, DoubleWritable, Text, DoubleWritable> {
    private DoubleWritable outvalue = new DoubleWritable();
 
    protected void reduce(Text key, Iterable<DoubleWritable> values,
           Context context) throws IOException, InterruptedException {
       double sum = 0.0;
       double count = 0;
       for (DoubleWritable dw : values) {
           sum += dw.get();
           ++count;
       }
       outvalue.set(sum / count);
       context.write(key, outvalue);
    }
}

Driver code。驱动代码解析命令行参数为这两个job得到输入输出目录。调用帮助方法提交job的配置，下面会看到。两个job对象会返回，并监控直到job的完成。只要其中一个job仍在运行，驱动就会再休息5秒。两个都完成以后，检查成功或失败，打印相关log信息。Job成功，则返回退出代码。

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Path belowAvgInputDir = new Path(args[0]);
    Path aboveAvgInputDir = new Path(args[1]);
    Path belowAvgOutputDir = new Path(args[2]);
    Path aboveAvgOutputDir = new Path(args[3]);
    Job belowAvgJob = submitJob(conf, belowAvgInputDir, belowAvgOutputDir);
    Job aboveAvgJob = submitJob(conf, aboveAvgInputDir, aboveAvgOutputDir);
    // While both jobs are not finished, sleep
    while (!belowAvgJob.isComplete() || !aboveAvgJob.isComplete()) {
       Thread.sleep(5000);
    }
    if (belowAvgJob.isSuccessful()) {
       System.out.println("Below average job completed successfully!");
    } else {
       System.out.println("Below average job failed!");
    }
    if (aboveAvgJob.isSuccessful()) {
       System.out.println("Above average job completed successfully!");
    } else {
       System.out.println("Above average job failed!");
    }
    System.exit(belowAvgJob.isSuccessful() && aboveAvgJob.isSuccessful() ? 0: 1);
}

帮助方法可以配置每个job，看起来很标准，除了使用job.Submit而不是Job.waitForCompletion。这样会提交job立刻返回，允许下面的代码继续执行。正如我们看到的，返回的job在main方法被监控直到完成。

private static Job submitJob(Configuration conf, Path inputDir,
       Path outputDir) throws Exception {
    Job job = new Job(conf, "ParallelJobs");
    job.setJarByClass(ParallelJobs.class);
    job.setMapperClass(AverageReputationMapper.class);
    job.setReducerClass(AverageReputationReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(DoubleWritable.class);
    job.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(job, inputDir);
    job.setOutputFormatClass(TextOutputFormat.class);
    TextOutputFormat.setOutputPath(job, outputDir);
    // Submit job and immediately return, rather than waiting for completion
    job.submit();
    return job;
}

With Shell Scripting

这种方法跟前面使用主驱动来启动单独的job驱动代码类似，除了使用脚本语言。在shell 脚本内，链中的每个job都可以用命令行指定的方式单独的启动。

这里有几个主要的益处和一对小的负面影响。一个好处是不用编译代码就能改变job流，因为驱动使用脚本语言，而不是java。对于失败可能性大的job，需要容易手动重新运行或修复失败的job。也可以把已经用于生产的job通过命令行调用，不通过脚本。另一个益处是shell脚本可以跟服务，系统，和非java写的工具交互。例如，本章随后讨论的输出的后处理，很自然的用sed或awk处理，很少用java。

Notice:用脚本封装MapReduce job，无论是一个java MapReduce，pig job或其它的，都有几个好处：后处理，数据流，数据准备，添加额外日志等等。

通常使用脚本能快速把新job和已有的job链起来。对健壮的程序，构建基于驱动的链机制能改善跟hadoop的接口，且更有意义。

Bash example。

本例中，我们使用bash shell把基本的job 链绑在一起并行执行。脚本分成两部分：设置job执行需要的变量，然后执行。

Bash script。输入输出保存在变量里用来创建几个可执行的命令。跑这两个job需要两个命令，cat输出到显示器，然后清除输出。

#!/bin/bash

JAR_FILE="mrdp.jar"

JOB_CHAIN_CLASS="mrdp.ch6.JobChainingDriver"

PARALLEL_JOB_CLASS="mrdp.ch6.ParallelJobs"

HADOOP="$( which hadoop )"

POST_INPUT="posts"

USER_INPUT="users"

JOBCHAIN_OUTDIR="jobchainout"

BELOW_AVG_INPUT="${JOBCHAIN_OUTDIR}/belowavg"

ABOVE_AVG_INPUT="${JOBCHAIN_OUTDIR}/aboveavg"

BELOW_AVG_REP_OUTPUT="belowavgrep"

ABOVE_AVG_REP_OUTPUT="aboveavgrep"

JOB_1_CMD="${HADOOP} jar ${JAR_FILE} ${JOB_CHAIN_CLASS} ${POST_INPUT}

${USER_INPUT} ${JOBCHAIN_OUTDIR}"

JOB_2_CMD="${HADOOP} jar ${JAR_FILE} ${PARALLEL_JOB_CLASS} ${BELOW_AVG_INPUT}

${ABOVE_AVG_INPUT} ${BELOW_AVG_REP_OUTPUT} ${ABOVE_AVG_REP_OUTPUT}"

CAT_BELOW_OUTPUT_CMD="${HADOOP} fs -cat ${BELOW_AVG_REP_OUTPUT}/part-*"

CAT_ABOVE_OUTPUT_CMD="${HADOOP} fs -cat ${ABOVE_AVG_REP_OUTPUT}/part-*"

RMR_CMD="${HADOOP} fs -rmr ${JOBCHAIN_OUTDIR} ${BELOW_AVG_REP_OUTPUT}

${ABOVE_AVG_REP_OUTPUT}"

LOG_FILE="avgrep_`date +%s`.txt"

下一部分脚本内容是在运行之前执行若干echo命令。然后执行第一个job，查看返回值判断是否失败。如果失败，删除输出目录，脚本退出执行。成功，执行第二个job。如果第二个job成功完成，每个job的输出写到日志文件并且输出被删除。额外的输出也是不需要的，因为输出文件只有一行数据，保存在日志文件要比hdfs更好。

{

echo ${JOB_1_CMD}

${JOB_1_CMD}

if [ $? -ne 0 ]

then

echo "First job failed!"

echo ${RMR_CMD}

${RMR_CMD}

exit $?

echo ${JOB_2_CMD}

${JOB_2_CMD}

if [ $? -ne 0 ]

then

echo "Second job failed!"

echo ${RMR_CMD}

${RMR_CMD}

exit $?

echo ${CAT_BELOW_OUTPUT_CMD}

${CAT_BELOW_OUTPUT_CMD}

echo ${CAT_ABOVE_OUTPUT_CMD}

${CAT_ABOVE_OUTPUT_CMD}

echo ${RMR_CMD}

${RMR_CMD}

exit 0

} &> ${LOG_FILE}

Sample run。运行输出如下，省略了MapReduce的一些信息。

/home/mrdp/hadoop/bin/hadoop jar mrdp.jar mrdp.ch6.JobChainingDriver posts

users jobchainout

12/06/10 15:57:43 INFO input.FileInputFormat: Total input paths to process : 5

12/06/10 15:57:43 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/06/10 15:57:43 WARN snappy.LoadSnappy: Snappy native library not loaded

12/06/10 15:57:44 INFO mapred.JobClient: Running job: job_201206031928_0065

...

12/06/10 15:59:14 INFO mapred.JobClient: Job complete: job_201206031928_0065

...

12/06/10 15:59:15 INFO mapred.JobClient: Running job: job_201206031928_0066

...

12/06/10 16:02:02 INFO mapred.JobClient: Job complete: job_201206031928_0066

/home/mrdp/hadoop/bin/hadoop jar mrdp.jar mrdp.ch6.ParallelJobs

jobchainout/belowavg jobchainout/aboveavg belowavgrep aboveavgrep

12/06/10 16:02:08 INFO input.FileInputFormat: Total input paths to process : 1

12/06/10 16:02:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/06/10 16:02:08 WARN snappy.LoadSnappy: Snappy native library not loaded

12/06/10 16:02:12 INFO input.FileInputFormat: Total input paths to process : 1

Below average job completed successfully!

Above average job completed successfully!

/home/mrdp/hadoop/bin/hadoop fs -cat belowavgrep/part-*

Average Reputation: 275.36385831014724

/home/mrdp/hadoop/bin/hadoop fs -cat aboveavgrep/part-*

Average Reputation: 2375.301960784314

/home/mrdp/hadoop/bin/hadoop fs -rmr jobchainout belowavgrep aboveavgrep

Deleted hdfs://localhost:9000/user/mrdp/jobchainout

Deleted hdfs://localhost:9000/user/mrdp/belowavgrep

Deleted hdfs://localhost:9000/user/mrdp/aboveavgrep

With JobControl

JobControl和ControlledJob类组成一个MapReduce 链的系统。并有一些很好的特性，例如跟踪链的状态，满足依赖关系时自动启动job。使用JobControl处理job链是正确的选择，但有时对简单的程序较重量级。

使用 JobControl，开始要用ControlledJob封装你的job。做法相对简单：创建job，并创建ControlledJob，它能接收job或Configuration，和一系列的依赖作为参数。然后把job一个一个加到JobControl对象。

也需要跟踪临时数据并在最后或失败时清除。

Job control example

本例在驱动中使用JobControl，让我们把前面两个基本job链和并行job链组合起来执行。我们已经熟悉了mapper和reducer代码，所以这里不需要叙述了。Job配置的驱动代码是主要展示的。它使用基本job链提交第一个job，然后用JobControl执行剩下的一个job链中的job和两个并行的job。初始job不加到JobControl，因为需要在中间过程中使用第一个job的计数器配置第二个job的阶段要打断控制。

所有的job在执行整个job链时必须完成配置，可能有局限性。

Main method。让我们看一下main方法。解析命令行参数创建四个job需要的所有路径。当命名变量以了解我们的数据流时要小心。然后第一个job通过帮助方法配置并执行。这个job完成后，通过配置方法配置三个ControlledJob对象。配置方法决定了job用那个mapper类，reducer类等等。

binningControlledJob没有依赖，当然要验证前一个job是否执行成功。下面的两个job都依赖binningControlledJob。在binning job执行成功之前，这两个job不会执行。如果没执行成功，这两个job也不会执行。

这三个ControlledJob都加到JobControl对象，然后运行。JobControl.run的调用会阻塞，直到这一组job的完成。然后检查是否有job失败并设置退出代码。退出之前要清除中间输出目录。

 
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Path postInput = new Path(args[0]);
    Path userInput = new Path(args[1]);
    Path countingOutput = new Path(args[3] + "_count");
    Path binningOutputRoot = new Path(args[3] + "_bins");
    Path binningOutputBelow = new Path(binningOutputRoot + "/"
           + JobChainingDriver.MULTIPLE_OUTPUTS_BELOW_NAME);
    Path binningOutputAbove = new Path(binningOutputRoot + "/"
           + JobChainingDriver.MULTIPLE_OUTPUTS_ABOVE_NAME);
    Path belowAverageRepOutput = new Path(args[2]);
    Path aboveAverageRepOutput = new Path(args[3]);
    Job countingJob = getCountingJob(conf, postInput, countingOutput);
    int code = 1;
    if (countingJob.waitForCompletion(true)) {
       ControlledJob binningControlledJob = new ControlledJob(
               getBinningJobConf(countingJob, conf, countingOutput,
                     userInput, binningOutputRoot));
       ControlledJob belowAvgControlledJob = new ControlledJob(
              getAverageJobConf(conf, binningOutputBelow,
                     belowAverageRepOutput));
       belowAvgControlledJob.addDependingJob(binningControlledJob);
       ControlledJob aboveAvgControlledJob = new ControlledJob(
              getAverageJobConf(conf, binningOutputAbove,
                     aboveAverageRepOutput));
       aboveAvgControlledJob.addDependingJob(binningControlledJob);
       JobControl jc = new JobControl("AverageReputation");
       jc.addJob(binningControlledJob);
       jc.addJob(belowAvgControlledJob);
       jc.addJob(aboveAvgControlledJob);
       jc.run();
       code = jc.getFailedJobList().size() == 0 ? 0 : 1;
    }
    FileSystem fs = FileSystem.get(conf);
    fs.delete(countingOutput, true);
    fs.delete(binningOutputRoot, true);
    System.exit(code);
}

Helper methods。下面是用到的帮助方法，用来创建具体的job或配置对象。ControlledJob能使用这两个类中的任意一个创建。这里有三个独立的方法，最后一个方法会使用过两次创建相同的两个并行job。输入输出在所有job中都是不同的。

public static Job getCountingJob(Configuration conf, Path postInput,
       Path outputDirIntermediate) throws IOException {
    // Setup first job to counter user posts
    Job countingJob = new Job(conf, "JobChaining-Counting");
    countingJob.setJarByClass(JobChainingDriver.class);
    // Set our mapper and reducer, we can use the API's long sum reducer for
    // a combiner!
    countingJob.setMapperClass(UserIdCountMapper.class);
    countingJob.setCombinerClass(LongSumReducer.class);
    countingJob.setReducerClass(UserIdSumReducer.class);
    countingJob.setOutputKeyClass(Text.class);
    countingJob.setOutputValueClass(LongWritable.class);
    countingJob.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(countingJob, postInput);
    countingJob.setOutputFormatClass(TextOutputFormat.class);
    TextOutputFormat.setOutputPath(countingJob, outputDirIntermediate);
    return countingJob;
}
 
public static Configuration getBinningJobConf(Job countingJob,
       Configuration conf, Path jobchainOutdir, Path userInput,
       Path binningOutput) throws IOException {
    // Calculate the average posts per user by getting counter values
    double numRecords = (double) countingJob
           .getCounters()
           .findCounter(JobChainingDriver.AVERAGE_CALC_GROUP,
                  UserIdCountMapper.RECORDS_COUNTER_NAME).getValue();
    double numUsers = (double) countingJob
           .getCounters()
           .findCounter(JobChainingDriver.AVERAGE_CALC_GROUP,
                  UserIdSumReducer.USERS_COUNTER_NAME).getValue();
    double averagePostsPerUser = numRecords / numUsers;
    // Setup binning job
    Job binningJob = new Job(conf, "JobChaining-Binning");
    binningJob.setJarByClass(JobChainingDriver.class);
    // Set mapper and the average posts per user
    binningJob.setMapperClass(UserIdBinningMapper.class);
    UserIdBinningMapper.setAveragePostsPerUser(binningJob,
           averagePostsPerUser);
    binningJob.setNumReduceTasks(0);
    binningJob.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(binningJob, jobchainOutdir);
    // Add two named outputs for below/above average
    MultipleOutputs.addNamedOutput(binningJob,
           JobChainingDriver.MULTIPLE_OUTPUTS_BELOW_NAME,
           TextOutputFormat.class, Text.class, Text.class);
    MultipleOutputs.addNamedOutput(binningJob,
           JobChainingDriver.MULTIPLE_OUTPUTS_ABOVE_NAME,
           TextOutputFormat.class, Text.class, Text.class);
    MultipleOutputs.setCountersEnabled(binningJob, true);
    // Configure multiple outputs
    conf.setOutputFormat(NullOutputFormat.class);
    FileOutputFormat.setOutputPath(conf, outputDir);
    MultipleOutputs.addNamedOutput(conf, MULTIPLE_OUTPUTS_ABOVE_5000,
           TextOutputFormat.class, Text.class, LongWritable.class);
    MultipleOutputs.addNamedOutput(conf, MULTIPLE_OUTPUTS_BELOW_5000,
           TextOutputFormat.class, Text.class, LongWritable.class);
    // Add the user files to the DistributedCache
    FileStatus[] userFiles = FileSystem.get(conf).listStatus(userInput);
    for (FileStatus status : userFiles) {
       DistributedCache.addCacheFile(status.getPath().toUri(),
              binningJob.getConfiguration());
    }
    // Execute job and grab exit code
    return binningJob.getConfiguration();
}
 
public static Configuration getAverageJobConf(Configuration conf,
       Path averageOutputDir, Path outputDir) throws IOException {
    Job averageJob = new Job(conf, "ParallelJobs");
    averageJob.setJarByClass(ParallelJobs.class);
    averageJob.setMapperClass(AverageReputationMapper.class);
    averageJob.setReducerClass(AverageReputationReducer.class);
    averageJob.setOutputKeyClass(Text.class);
    averageJob.setOutputValueClass(DoubleWritable.class);
    averageJob.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(averageJob, averageOutputDir);
    averageJob.setOutputFormatClass(TextOutputFormat.class);
    TextOutputFormat.setOutputPath(averageJob, outputDir);
    // Execute job and grab exit code
    return averageJob.getConfiguration();
}

MapReduce Design Patterns（6. Job链）（十一） Chapter 6. Meta patterns

Job chaining

With the Driver

Job Chaining Examples

Basic job chaining

Parallel job chaining

With Shell Scripting

With JobControl

Job control example

相关推荐