numpy的和静态链接

问题描述:

我运行一个大型集群上的星火计划(为此,我没有管理权限)。 numpy的未安装工作节点上。因此,我捆绑 numpy的与我的计划,但我得到了以下错误:

I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error:

Traceback (most recent call last):
  File "/home/user/spark-script.py", line 12, in <module>
    import numpy
  File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module>
ImportError: cannot import name multiarray

剧本其实很简单:

The script is actually quite simple:

from pyspark import SparkConf, SparkContext
sc = SparkContext()

sc.addPyFile('numpy.zip')

import numpy

a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
print a.collect()

据我了解,发生错误,因为 numpy的动态加载 multiarray.so 的依赖,即使我的 numpy.zip 文件包括 multiarray.so 文件,不知何故动态加载不与阿帕奇星火工作。为什么这样?你如何othewise创建一个独立的 numpy的模块静态链接?

I understand that the error occurs because numpy dynamically loads multiarray.so dependency and even if my numpy.zip file includes multiarray.so file, somehow the dynamic loading doesn't work with Apache Spark. Why so? And how do you othewise create a standalone numpy module with static linking?

感谢。

有至少两个问题,你的方法和既可以减少到一个简单的事实,numpy的是一个重量级的依赖。

There are at least two problems with your approach and both can be reduced to a simple fact that NumPy is a heavyweight dependency.


  • 所有的Debian软件包首先配备了多个相关性,包括 libgfortran libblas liblapack libquadmath 。所以你不能简单地复制numpy的安装,并期望事情会工作(说实话,你不应该做这样的事,如果不是这种情况)。从理论上讲,你可以尝试使用静态链接来构建它,这种方式与所有依赖出货,但它击中了第二个问题。

  • First of all Debian packages come with multiple dependencies including libgfortran, libblas, liblapack and libquadmath. So you cannot simply copy NumPy installation and expect that things will work (to be honest you shouldn't do anything like this if it wasn't the case). Theoretically you could try to build it using static linking and this way ship it with all the dependencies but it hits the second issue.

numpy的是pretty本身大。虽然20MB看起来并不特别IM pressive和所有依赖它不应该是更多的40MB它每次启动作业时被运到工人。越多的工人你有它获取更糟。如果你决定你需要SciPy的或SciKit它可以变得更糟。

NumPy is pretty large by itself. While 20MB doesn't look particularly impressive and with all the dependencies it shouldn't be more 40MB it has to be shipped to the workers each time you start your job. The more workers you have the worse it gets. If you decide you need SciPy or SciKit it can get much worse.

可以说这使得numpy的为正随 pyFile 法一个非常糟糕的人选。

Arguably this makes NumPy a really bad candidate for being shipped with pyFile method.

如果您有没有工人,但所有的依赖性,包括头文件和静态库直接访问被present,你可以简单地尝试从任务本身的用户空间来安装numpy的(它假定该 PIP 安装以及)像这样的东西:

If you hadn't have direct access to the workers but all the dependencies, including header files and a static library were present, you could simply try to install NumPy in the user space from the task itself (it assumes that pip is installed as well) with something like this:

try:
    import numpy as np

expect ImportError:
    import pip
    pip.main(["install", "--user", "numpy"])
    import numpy as np

您会在运行时发现该方法的其他变种如何安装和导入Python模块?

既然你已经获得了工人更好的解决方案是创建一个单独的Python环境。可能是最简单的方法是使用阿纳康达的可用于包装非Python的依赖关系,以及与不依赖于全系统库。您可以使用诸如Ansible或结构的工具很容易地自动执行此任务,它不需要管理权限,你真正需要的是bash和一些方法来获取基本的安装程序(wget的,卷曲,rsync的,SCP)。

Since you have access to the workers a much better solution is to create a separate Python environment. Probably the simplest approach is to use Anaconda which can be used to package non-Python dependencies as well and doesn't depend on the system-wide libraries. You can easily automate this task using tools like Ansible or Fabric, it doesn't require administrative privileges and all you really need is bash and some way to fetch basic installers (wget, curl, rsync, scp).

另请参阅:航运Python模块在pyspark到其他节点