在 Jython 的 Pig UDF 中导入外部库时出现错误 1121

问题描述:

我在 jython 中使用 python 库 simplejson 编写 Pig UDF 时遇到问题.我需要,因为 jython-standalone-2.5.2.jar 没有附带 JSON 库.我正在使用 Apache Pig 版本 0.11.0-cdh4.4.0(rexported)编译于 2013 年 9 月 3 日,20:25:46,并根据文档 http://pig.apache.org/docs/r0.11.1/udf.html#python-advanced "可以导入PythonPython 脚本中的模块.Pig 递归地解析 Python 依赖项,这意味着 Pig 会自动将所有依赖的 Python 模块传送到后端.Python 模块应在 jython 搜索路径中找到:JYTHON_HOME、JYTHON_PATH 或当前目录.".所以我从 https://pypi.python.org/pypi/simplejson/ 下载库, 将其解压缩到我的工作目录中,然后我的脚本在本地模式下工作(使用 -x local).尽管如此,在集群模式下,我在任务跟踪器的失败日志中收到此错误:

I'm having a problem using the python library simplejson in jython to write a Pig UDF. I need because jython-standalone-2.5.2.jar doesn't come with a JSON library. I'm using Apache Pig version 0.11.0-cdh4.4.0 (rexported) compiled Sep 03 2013, 20:25:46, and according to the documentation http://pig.apache.org/docs/r0.11.1/udf.html#python-advanced "You can import Python modules in your Python script. Pig resolves Python dependencies recursively, which means Pig will automatically ship all dependent Python modules to the backend. Python modules should be found in the jython search path: JYTHON_HOME, JYTHON_PATH, or current directory.". So I download the library from https://pypi.python.org/pypi/simplejson/, unzip it in my working directory and then my script works in local mode (with -x local). Nevertheless in cluster mode I get this error in the failed logs of the task tracker:

Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1121: Python Error. Traceback (most recent call last):
  File "ejercicio4-udfs.py", line 8, in <module>
ImportError: No module named simplejson

    at org.apache.pig.scripting.jython.JythonScriptEngine$Interpreter.execfile(JythonScriptEngine.java:231)
    at org.apache.pig.scripting.jython.JythonScriptEngine$Interpreter.init(JythonScriptEngine.java:158)
    at org.apache.pig.scripting.jython.JythonScriptEngine.getFunction(JythonScriptEngine.java:349)
    at org.apache.pig.scripting.jython.JythonFunction.<init>(JythonFunction.java:55)
    ... 92 more
Caused by: Traceback (most recent call last):
  File "ejercicio4-udfs.py", line 8, in <module>
ImportError: No module named simplejson

我尝试了几种方法,例如压缩 simplejson 并注册 zip 并尝试使用 sys.path.append('simplejson.zip') 访问它,我也尝试过:

I've tried several things, like zipping simplejson and registering the zip and trying to access it with sys.path.append('simplejson.zip'), I've also tried with:

export JYTHONPATH=$JYTHONPATH:$(pwd)/simplejson.zip; pig script.pig

还有

pig -Dmapred.cache.files="simplejson.zip#simplejson.zip" -Dmapred.create.symlink=yes script.zip

我不知道我的答案是否来得太晚,但我设法在 UDF 中导入了 simplejson.

I don't know if my answer come too late but I managed to import simplejson in an UDF.

我是这样做的:

我下载了 simplejson 并将其放入一个 lib 文件夹中,然后在我的 UDF 中我这样做了:

I downloaded simplejson and put it into a lib folder, then in my UDF I did this :

import sys
sys.path.append('/path/to/your/lib/folder')
import simplejson as json

然后我设法在我的集群上执行 json.loads() 没有任何问题.

I then managed to do a json.loads() without any problem on my cluster.

希望能帮到你