几个410错误后数据流作业失败(在写入GCS时)

问题描述:

我从8月发现了一个类似的SO问题,该问题或多或少是我的团队最近在我们的数据流管道中遇到的问题. 如何从中恢复com.google.api.client.googleapis.json.GoogleJsonResponseException上的Cloud Dataflow作业失败:410消失

I found a similar SO question from August which is more or less what my team is experiencing with our dataflow pipelines recently. How to recover from Cloud Dataflow job failed on com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone

这是一个例外(在大约1个小时的时间内抛出了410个例外,但我仅粘贴了最后一个)

Here is the exception (a few 410 exceptions were thrown over a span of ~1 hour, but I am pasting only the last one)

(9f012f4bc185d790): java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
  "code" : 500,
  "errors" : [ {
    "domain" : "global",
    "message" : "Backend Error",
    "reason" : "backendError"
  } ],
  "message" : "Backend Error"
}
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
    at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
    at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:97)
    at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:287)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:223)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
    at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: java.nio.channels.ClosedChannelException
        at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.throwIfNotOpen(AbstractGoogleAsyncWriteChannel.java:408)
        at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:286)
        at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
        at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.abort(WriteOperation.java:112)
        at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:86)
        ... 10 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
{
  "code" : 500,
  "errors" : [ {
    "domain" : "global",
    "message" : "Backend Error",
    "reason" : "backendError"
  } ],
  "message" : "Backend Error"
}
    at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
    ... 4 more
 2016-12-18 (18:58:58) Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/D...
(d3b59c20088d726e): Workflow failed. Causes: (8e88b50b0d86156a): S26:up:2016-12-18:userprofile-lifetime-20161218/Write/DataflowPipelineRunner.BatchWrite/Finalize failed., (2f2e396b6ba3aaa2): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on: userprofile-lifetime-diff-12181745-c04d-harness-0xq5, userprofile-lifetime-diff-12181745-c04d-harness-adga, userprofile-lifetime-diff-12181745-c04d-harness-cz30, userprofile-lifetime-diff-12181745-c04d-harness-lt9m

这是工作ID:2016-12-18_17_45_23-2873252511714037422

Here is the job id: 2016-12-18_17_45_23-2873252511714037422

根据我前面提到的另一个SO问题的答案,我正在使用指定的分片数量重新运行同一作业(该作业每天运行时为4000,通常输出约4k个文件). 将分片数量限制在10k以下有什么理由吗?知道这一点对于我们在需要时重新设计管道很有用.

I am re-running the same job with the number of shards specified (to be 4000 as this job runs daily and normally outputs ~4k files), based on the answer to the other SO question I mentioned earlier. Is there a reason why limiting the number of shards to a number below 10k helps? Knowing this can be useful for us to re-design our pipeline if needed.

此外,当指定了分片数量时,该作业所花费的时间比没有指定分片时要长得多(主要是由于写入GCS的步骤)-以$表示,该作业的成本通常为$ 75 -80(而且我们每天都会执行此工作),而当我指定分片的数量(其他步骤似乎或多或少地运行了相同的时间)时,费用为$ 130- $ 140(即增长74%)职位ID为2016-12-18_19_30_32-7274262445792076535).因此,如果可能的话,我们真的想避免指定碎片数量.

Also, when the number of shards is specified, the job is taking much longer than it would without it specified (mainly because of the step that writes to GCS) -- in terms of $'s, this job usually costs $75-80 (and we run this job daily), whereas it cost $130-$140 (that is a 74% increase) when I specified the number of shards (other steps seem to have run for the same duration, more or less -- the job id is 2016-12-18_19_30_32-7274262445792076535). So we really want to avoid having to specify the number of shards, if possible.

任何帮助和建议将不胜感激!

Any help and suggestions will be really appreciated!

-随访 当我在输出目录中尝试"gsutil ls"时,即使在作业完成十多个小时之后,该作业的输出似乎也消失了,然后出现在GCS中.这可能是一个相关的问题,但是我在这里创建了一个单独的问题("; gsutil ls每次都会显示不同的列表.).

-- Follow-up The output of this job seems to be disappearing and then appearing in GCS when I try 'gsutil ls' at the output directory, even 10+ hours after the job is completed. This may be a related issue, but I created a separate question here ("gsutil ls" shows a different list every time).

是的-指定分片的数量对Dataflow执行作业的方式施加了额外的约束,并可能影响性能.例如,动态工作重新平衡与固定数量的分片不兼容.

Yes -- specifying the number of shards enforces additional constraints on how Dataflow executes the job and may impact performance. For example,dynamic work rebalancing is incompatible with a fixed number of shards.

据我了解,410 gone是GCS的临时问题,通常Dataflow的重试机制可以解决该问题.但是,如果发生率过高,则有可能使作业失败.在批处理模式下,如果单个分发包失败4次,则Dataflow将使作业失败.

As far as I understand, the 410 gone is a temporary GCS issue, and generally Dataflow's retry mechanism can work around it. However, if it occurs at too high a rate, it has the potential to fail the job. In batch mode, Dataflow will fail the job if a single bundle fails 4 times.