Dask的默认pip安装会显示"ImportError:没有名为toolz的模块".
我使用如下pip安装了 Dask :
I installed Dask using pip like this:
pip install dask
,当我尝试执行import dask.dataframe as dd
时,出现以下错误消息:
and when I try to do import dask.dataframe as dd
I get the following error message:
>>> import dask.dataframe as dd
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/venv/lib/python2.7/site-packages/dask/__init__.py", line 5, in <module>
from .async import get_sync as get
File "/path/to/venv/lib/python2.7/site-packages/dask/async.py", line 120, in <module>
from toolz import identity
ImportError: No module named toolz
No module named toolz
我注意到文档指出
pip install dask
:仅安装dask,这仅取决于标准库.如果只需要任务计划程序,这是合适的.
pip install dask
: Install only dask, which depends only on the standard library. This is appropriate if you only want the task schedulers.
所以我对为什么它不起作用感到困惑.
so I'm confused as to why this didn't work.
为了使用Dask的并行化数据框(基于熊猫),您必须告诉pip安装一些"extras"( Dask安装文档:
In order to use Dask's parallelized dataframes (built on top of pandas), you have to tell pip to install some "extras" (reference), as mentioned in the Dask installation documentation:
pip install "dask[dataframe]"
或者你可以做
pip install "dask[complete]"
获取全部技巧. NB:您的外壳中可能需要也可能不需要双引号.
在Dask文档中提到了(或曾经)这样做的理由.
The justification for this is (or was) mentioned in the Dask documentation:
我们这样做是为了使轻量级核心dask Scheduler的用户不需要下载集合中更奇特的依赖项(numpy,pandas等)
We do this so that users of the lightweight core dask scheduler aren’t required to download the more exotic dependencies of the collections (numpy, pandas, etc.)
如 Obinna的答案中所述,您可能希望在virtualenv内进行此操作,或使用pip install --user
将其放入(例如,如果您没有对主机OS的管理员权限).
As mentioned in Obinna's answer, you may wish to do this inside a virtualenv, or use pip install --user
to put the libraries in your home directory, if, say, you don't have admin privileges on to the host OS.
在Dask 0.13.0及更低版本中,需要使用 toolz 'identity
dask/async.py
中的功能.与GitHub相关的 open 封闭的拉取请求 issue#1849 删除此依赖项. 与此同时如果由于某种原因您陷入了较早版本的dask,则可以通过简单地执行pip install toolz
来解决特定问题的 .
At Dask 0.13.0 and below, there was a requirement on toolz' identity
function within dask/async.py
. There is an open a closed pull request associated with GitHub issue #1849 to remove this dependency. In the meantime If, for some reason, you are stuck with an older version of dask, you can work around that particular issue by simply doing pip install toolz
.
但这仍然不能(完全)解决import dask.dataframe as dd
的问题.因为您仍然会遇到 this 错误:
But this wouldn't (completely) fix your problem with import dask.dataframe as dd
anyway. Because you'd still get this error:
>>> import dask.dataframe as dd
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/staff_agbio/PhyloWeb/data/dask-test/venv/local/lib/python2.7/site-packages/dask/dataframe/__init__.py", line 3, in <module>
from .core import (DataFrame, Series, Index, _Frame, map_partitions,
File "/data/staff_agbio/PhyloWeb/data/dask-test/venv/local/lib/python2.7/site-packages/dask/dataframe/core.py", line 12, in <module>
import pandas as pd
ImportError: No module named pandas
,或者如果您已经安装了熊猫,则会得到ImportError: No module named cloudpickle
.因此,如果您在这种情况下,pip install "dask[dataframe]"
似乎是可行的方法.
or if you had pandas installed already, you'd get ImportError: No module named cloudpickle
. So, pip install "dask[dataframe]"
seems to be the way to go if you're in this situation.