Google DataFlow/Python:使用__main__中的save_main_session和自定义模块导入错误

Google DataFlow/Python:使用__main__中的save_main_session和自定义模块导入错误

问题描述:

有人可以阐明使用save_main_session和在__main__中导入的自定义模块时的预期行为.我的DataFlow管道导入2个非标准模块-一个通过requirements.txt,另一个通过setup_file.除非将导入移到使用它们的函数中,否则我将不断收到导入/处理错误.示例错误如下.从文档中,我认为设置save_main_session将有助于解决此问题,但并不能解决问题(请参见下面的错误).所以我想知道我是否错过了某些事情,或者这种行为是设计使然.将相同的导入放置到函数中后,效果很好.

Could somebody please clarify the expected behavior when using save_main_session and custom modules imported in __main__. My DataFlow pipeline imports 2 non-standard modules - one via requirements.txt and another one via setup_file. Unless I move the imports into the functions where they get used I keep getting import/pickling errors. Sample error is below. From the documentation, I assumed that setting save_main_session would help to solve this problem, but it does not (see error below). So I wonder if I missed something or this behavior is by design. The same import works fine when placed into a function.

错误:


  File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
    __import__(module)
ImportError: No module named jmespath

"> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

何时使用--save_main_session:

您可以将--save_main_session管道选项设置为True.这将导致全局名称空间的状态被腌制并加载到Cloud Dataflow工作器上

you can set the --save_main_session pipeline option to True. This will cause the state of the global namespace to be pickled and loaded on the Cloud Dataflow worker

最适合我的设置是将dataflow_launcher.py与您的setup.py放在项目根目录下.它唯一要做的就是导入管道文件并启动它.使用setup.py处理所有依赖项.这是到目前为止我发现的最好的例子.

The setup that best works for me is having a dataflow_launcher.py sitting at the project root with your setup.py. The only thing it does is import your pipeline file and launch it. Use setup.py to handle all your dependencies. This is the best example I've found so far.

https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset