apex 安装总结 1.cuda版本不匹配 2.安装nvcc 3.遇到“Given no hashes to check XXX links for project 'pip': discarding no candidates”错误 5.后续

最近使用一个库,依赖apex。折腾一个早上才安装好。做记录以方便后来者。

环境:
系统: Windows

库:pytorch1.9.0
cuda版本: 11.1

vs : 2019 

vs补充说明,除 vsapex 安装总结
1.cuda版本不匹配
2.安装nvcc
3.遇到“Given no hashes to check XXX links for project 'pip': discarding no candidates”错误
5.后续和默认推荐C++推荐安装外。遇到问题的时候,临时装apex 安装总结
1.cuda版本不匹配
2.安装nvcc
3.遇到“Given no hashes to check XXX links for project 'pip': discarding no candidates”错误
5.后续apex 安装总结
1.cuda版本不匹配
2.安装nvcc
3.遇到“Given no hashes to check XXX links for project 'pip': discarding no candidates”错误
5.后续

且没有重启电脑。理论上应该和apex安装无关。因为过程发生操作,所以此处也做记录。

库推荐使用pytorch1.7.1  cuda=10.2   。按照库给出的说明安装,提示cuda库不匹配。

打开 “apex/setup.py” 文件 ,查看代码 发现 torch的cuda版本(torch_binary_major ,torch_binary_minor)和安装的cuda驱动版本要一致nvcc(bare_metal_major,bare_metal_minor)

def get_cuda_bare_metal_version(cuda_dir):
    raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
    output = raw_output.split()
    release_idx = output.index("release") + 1
    release = output[release_idx].split(".")
    bare_metal_major = release[0]
    bare_metal_minor = release[1][0]

    return raw_output, bare_metal_major, bare_metal_minor

def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
    raw_output, bare_metal_major, bare_metal_minor = get_cuda_bare_metal_version(cuda_dir)
    torch_binary_major = torch.version.cuda.split(".")[0]
    torch_binary_minor = torch.version.cuda.split(".")[1]

    print("
Compiling cuda extensions with")
    print(raw_output + "from " + cuda_dir + "/bin
")

    if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
        raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
                           "not match the version used to compile Pytorch binaries.  " +
                           "Pytorch binaries were compiled with Cuda {}.
".format(torch.version.cuda) +
                           "In some cases, a minor-version mismatch will not cause later errors:  " +
                           "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
                           "You can try commenting out this check (at your own risk).")

解决办法,cuda和pytorch之间,一者适应另一者 。另外,查看SetUp,py文件,cuda版本>10.0

最终选择

python:3.7

pytorch安装命令“”

2.安装nvcc

cmd激活命令, 输入 “nvcc -V” 提示不是系统命令

重新安装cuda11.1 ,选择自定义,去除其余,勾选nvcc 。安装。 
接着设定 nvcc的路径到系统路径 。然后参考网上命令 激活Path(正在跑程序,不想重启电脑)
cmd窗口输入“nvcc -V” 。结果正常
疑似此处留的坑,当时安装完没重启,可能因此导致后面安装失败,直到重启为止。


3.遇到“Given no hashes to check XXX links for project 'pip': discarding no candidates”错误

一直卡在这个提示
1)首先,打开“apex/requirements.txt”,“apex/requirements_dev.txt” ,对照conda list ,安装缺失的库。

2)其次,“https://blog.csdn.net/qq_33019383/article/details/103990248” 说要安装 torch-scatter 。于是安装。
3)网上说删除之前下载的“C:UsersAdministratorapex”文件夹,重新执行如下命令

git clone https://www.github.com/nvidia/apex
cd apex
python3 setup.py install

遗憾的是以上都没有生效
4.最终解决

重启电脑。因为前面说的库,还依赖其它,就顺手装

pip install opencv-python==4.4.0.46 termcolor==1.1.0 yacs==0.1.8 diffdist

然后执行

cd apex
python3 setup.py install 

有警告,但安装成功了。

torch.__version__  = 1.9.0


setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
  warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
running install
running bdist_egg
running egg_info
writing apex.egg-infoPKG-INFO
writing dependency_links to apex.egg-infodependency_links.txt
writing top-level names to apex.egg-info	op_level.txt
reading manifest file 'apex.egg-infoSOURCES.txt'
writing manifest file 'apex.egg-infoSOURCES.txt'
installing library code to builddist.win-amd64egg
running install_lib
running build_py
creating buildlib
creating buildlibapex
copying apex__init__.py -> buildlibapex
creating buildlibapexamp
copying apexampamp.py -> buildlibapexamp
copying apexampcompat.py -> buildlibapexamp
……
copying buildlibapexpyprof
vtx__init__.py -> builddist.win-amd64eggapexpyprof
vtx
creating builddist.win-amd64eggapexpyprofparse
copying buildlibapexpyprofparsedb.py -> builddist.win-amd64eggapexpyprofparse
……
copying buildlibapexRNN__init__.py -> builddist.win-amd64eggapexRNN
copying buildlibapex__init__.py -> builddist.win-amd64eggapex
byte-compiling builddist.win-amd64eggapexampamp.py to amp.cpython-37.pyc
……
byte-compiling builddist.win-amd64eggapexRNNRNNBackend.py to RNNBackend.cpython-37.pyc
byte-compiling builddist.win-amd64eggapexRNN__init__.py to __init__.cpython-37.pyc
byte-compiling builddist.win-amd64eggapex__init__.py to __init__.cpython-37.pyc
creating builddist.win-amd64eggEGG-INFO
copying apex.egg-infoPKG-INFO -> builddist.win-amd64eggEGG-INFO
copying apex.egg-infoSOURCES.txt -> builddist.win-amd64eggEGG-INFO
copying apex.egg-infodependency_links.txt -> builddist.win-amd64eggEGG-INFO
copying apex.egg-info	op_level.txt -> builddist.win-amd64eggEGG-INFO
zip_safe flag not set; analyzing archive contents...
apex.pyprof.nvtx.__pycache__.nvmarker.cpython-37: module references __file__
apex.pyprof.nvtx.__pycache__.nvmarker.cpython-37: module references __path__
creating dist
creating 'distapex-0.1-py3.7.egg' and adding 'builddist.win-amd64egg' to it
removing 'builddist.win-amd64egg' (and everything under it)
Processing apex-0.1-py3.7.egg
creating c:programdataanaconda3envsXXXXlibsite-packagesapex-0.1-py3.7.egg
Extracting apex-0.1-py3.7.egg to c:programdataanaconda3envsXXXXlibsite-packages
Adding apex 0.1 to easy-install.pth file

Installed c:programdataanaconda3envsXXXXlibsite-packagesapex-0.1-py3.7.egg
Processing dependencies for apex==0.1
Finished processing dependencies for apex==0.1

5.后续

1)

后面发现执行设定精度设置的语句会报错,所以实际没安装成功。

并且再次执行命令

python setup.py install

命令执行,直接换行,没有执行结果。

改用

python setup.py build
pip install -v --no-cache-dir

执行结果

torch.__version__  = 1.9.0


setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
  warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

running bdist_wheel
running build
running build_py
installing to builddist.win-amd64wheel
running install
running install_lib
………………………………………………………………………………………………………………………………………………
adding 'apex-0.1.dist-info/WHEEL'
adding 'apex-0.1.dist-info/top_level.txt'
adding 'apex-0.1.dist-info/RECORD'
removing builddist.win-amd64wheel
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "C:ProgramDataAnaconda3envspytorch1.8.1libsite-packagescoloramaansitowin32.py", line 59, in closed
return stream.closed
ValueError: underlying buffer has been detached
done
Created wheel for apex: filename=apex-0.1-py3-none-any.whl size=206058 sha256=8761f64146164553df82742b07c5ef2cfe9da3a82a636b9457483cb95a9544ba
Stored in directory: C:UsersAdministratorAppDataLocalTemppip-ephem-wheel-cache-8l21lyriwheels17e2d0fbd642567ec1ec2e05aa8db3ae5d45c586c0f909da3f40de6e
Successfully built apex
Installing collected packages: apex


Successfully installed apex-0.1
1 location(s) to search for versions of pip:
* https://pypi.org/simple/pip/
Fetching project page and analyzing links: https://pypi.org/simple/pip/
Getting page https://pypi.org/simple/pip/
Found index url https://pypi.org/simple
Starting new HTTPS connection (1): pypi.org:443
https://pypi.org:443 "GET /simple/pip/ HTTP/1.1" 200 16538
……………………………………………………………………………………………………………………………………………………………………
Found link https://files.pythonhosted.org/packages/b1/44/6e26d5296b83c6aac166e48470d57a00d3ed1f642e89adc4a4e412a01643/pip-21.1.2.tar.gz#sha256=eb5df6b9ab0af50fe1098a52fd439b04730b6e066887ff7497357b9ebd19f79b (from https://pypi.org/simple/pip/) (requires-python:>=3.6), version: 21.1.2
Skipping link: not a file: https://pypi.org/simple/pip/
Given no hashes to check 167 links for project 'pip': discarding no candidates
Removed build tracker: 'C:\Users\Administrator\AppData\Local\Temp\pip-req-tracker-hs8z7jdp'


Successfully installed apex-0.1”显示安装成功。但是要注意命令没有安装cuda拓展和C++拓展。一旦代码运用到涉及的部分,就会出现问题。

比如:运行swin_Transformer 示例。 会弹警告,提示找不到 “amp_C” 。连锁反应“

torch.distributed.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)

这一句执行弹出警告,实际执行失败,没有完成分布式运算初始化。 进而导致,后续跟分布式有关代码全部要手动注释掉(抽样,训练时世代设置)

2)

其余安装方法参考 codebrid的 apex 安装/使用 记录

测试参考apex 安装/使用 记录