坎坎坷坷的深度学习之路（一）-环境搭建

xiaoxiao2021-02-28 51

大家好，我是新人，这是我的第一篇技术连载，希望可以在深度学习的路上走上一走，我会尽可能说的清楚一点，大佬们能喷就喷，这样才能进步。此后还请多多关照。（另外如果排版不好看的话欢迎评论，我会逐步优化我的排版习惯）

初步计划的学习进程：Python3.5，tensorflow，blocksparse（这个是我在实习时的老板让我接触的），等等等等，这个等等等等就在以后遇到问题的时候再补足吧。

在试了几次后，先写清楚我的环境吧：

python3.5+，必须是python3.5+，虽然blocksparse的github上写的2.7+，但是bs的自身代码有问题，必须是3.5+；tensorflow-gpu-1.4.1，这里遇见了很多问题，在下面的内容中我会提到；blocksparse-1.0.0，是openai的一个框架，事实上这个框架好像很少人用过，所以百度都没有这个词儿。。。；nvidia，最新显卡驱动；cuda8.0，想用9.1，但是遇见的问题比较多，应该是我的技术很不过关；cudnn6.0，之后会更新到新的版本；ubuntu16.04 我用的服务器带了Nvidia-tesla M40的显卡，当然只要是支持cuda的显卡都是可以哒。

下面进入正题--使用pip3搭建环境：

python3.5+，这个应该不需要多啰嗦，apt-get就行了 apt-get install python3 libpython3-dev #如果遇见少了包的话就按照提示来，不同的人遇到的问题可能不同 pip3，按照很多人的习惯，这个直接用 sudo apt-get install python3-pip 也可以，但是这样的版本比较老，推荐使用下面的方式 wget https://bootstrap.pypa.io/get-pip.py python3 getpip.py nvidia-driver，在官网上下载最新的驱动并且安装； cuda toolkit，点击这里下载toolkit，可以选择自己需要的cuda toolkit，安装到喜欢的位置，我使用了默认的/usr/local/cuda cudnn，点击这里下载对应版本的cudnn，可以不用管文件的后缀“solitairetheme8”，直接利用命令解压，然后将libcudnn.*复制到/cuda/install/directory/lib64下面（我的在/usr/local/cuda/），并且建立软连接，最后记得使用ldconfig tar xvf cudnn-8.0-linux-x64-v6.0.solitairetheme8 cp cuda/lib64/libcudnn.* /usr/local/cuda/lib64 cp cuda/include/cudnn.h /usr/local/cuda/include cd /usr/local/cuda/lib64 rm libcudnn.so.6.0 libcudnn.so.6 ln -s libcudnn.so.6.0.53 libcudnn.so.6.0 ln -s libcudnn.so.6.0 libcudnn.so.6 ldconfig tensorflow-gpu，这里直接通过pip3安装 pip3 install tensorflow-gpu blocksparse，这里也直接通过pip3安装 pip3 install blocksparse 环境搭建基本上就没问题了，如果大家遇到其他问题，在评论区留言哦，咱们一起整理出各种问题的解决方案再次进入正题--使用源码安装： nvidia-driver，cuda，cudnn，必须使用官网的安装包，跟上述一样，这里不再赘述；下载tensorflow源码先 #从github clone源代码下来 git clone --recurse-submodules https://github.com/tensorflow/tensorflow #这里的 --recurse-submodules 必须有，目的是把各种依赖分支也clone下来。下载编译bazel #bazel是必须的，但是安装这个玩意儿特别蛋疼，百度的方法都一样，但是都不解决问题，这里说一下我的经验。 #先下载源码，两种方法 #方法1，从github获取 git clone https://github.com/bazelbuild/bazel.git #但是速度奇慢，所以可以用另一个方法，从官网下载源码包，速度较快 wget https://github.com/bazelbuild/bazel/archive/0.9.0.tar.gz tar xvf 0.9.0.tar.gz #按照官网所说需要使用./compile.sh来编译，但是始终会报错（也可能是我的环境有问题） #这里使用 cd bazel-0.9.0 bazel build //src:bazel #这时在./bazel-bin/src/bazel有编译好的bazel，复制到/usr/bin目录下面 mv /usr/bin/bazel /usr/bin/bazel.bak cp ./bazel-bin/src/bazel /usr/bin/ 安装必要的依赖库 #这里主要是numpy和scipy（这个是未来blocksparse需要的），使用pip3安装 pip3 install numpy pip3 install scipy #如果显示缺了其他库，用pip3或者sudo apt-get install xxx安装一下就好了开始编译tensorflow并开启gpu支持 #先进入tensorflow的目录 cd ~/tensorflow ./configure #接下来会有一系列的选项供选择，需要的Y，不需要的N，不知道的回车就好了 Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python3.5 Found possible Python library paths: /usr/local/lib/python3.5/dist-packages /usr/lib/python3/dist-packages Please input the desired Python library path to use. Default is [/usr/local/lib/python3.5/dist-packages] #回车 Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n No jemalloc as malloc support will be enabled for TensorFlow. Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n No Google Cloud Platform support will be enabled for TensorFlow. Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n No Hadoop File System support will be enabled for TensorFlow. Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n No Amazon S3 File System support will be enabled for TensorFlow. Do you wish to build TensorFlow with XLA JIT support? [y/N]: n No XLA JIT support will be enabled for TensorFlow. Do you wish to build TensorFlow with GDR support? [y/N]: n No GDR support will be enabled for TensorFlow. Do you wish to build TensorFlow with VERBS support? [y/N]: n No VERBS support will be enabled for TensorFlow. Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n No OpenCL SYCL support will be enabled for TensorFlow. Do you wish to build TensorFlow with CUDA support? [y/N]: y #因为要用cuda，这里一定要输y CUDA support will be enabled for TensorFlow. Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 8.0 Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: Please specify a list of comma-separated Cuda compute capabilities you want to build with. You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus. Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]5.2 Do you want to use clang as CUDA compiler? [y/N]: nvcc will be used as CUDA compiler. Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: Do you wish to build TensorFlow with MPI support? [y/N]: No MPI support will be enabled for TensorFlow. Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: Not configuring the WORKSPACE for Android builds. Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details. --config=mkl # Build with MKL support. --config=monolithic # Config for mostly static monolithic build. Configuration finished ###################################################################### #设置就完成了，接下来开始便宜tf bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer ##因为笔者的服务器在执行此命令后会显示cpu的指令集问题，所以我使用了bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --config=cuda //tensorflow/cc:tutorials_example_trainer 来避免那个问题，如果有人也遇到了，执行这个就可以啦。 bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu #这是tf自带的一个测试程序，用来计算一个矩阵特征值，会有大量的信息输出 #接下来创建pip包然后安装 bazel build -c opt //tensorflow/tools/pip_package:build_pip_package #指令集问题参考上方 bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg pip3 install /tmp/tensorflow_pkg/tensorflow-1.4.1-cp35-none-linux_x86_64.whl 至此，我的环境就搭建好了，当然肯定有其他各种各样奇奇怪怪的错误，本文章中有很多我遇到的问题，也有我解决它们的一些经验，还希望有大佬指正。问题探讨为啥一定要用py35+，官方说2.7以上就可以啊？因为blocksparse的ewops等文件中包含了3.x的语法，2.x已经不能用啦，笔者花了很久才找到这个问题的原因。不厚道啊，为啥不说blocksparse的编译方法呢！？是因为我编译不出来，大概是这个git不完整，make compile的时候总会说头文件是空的，事实上那个头文件确实是空的。。。。git地址在https://github.com/openai/blocksparse.git，如果有大佬编译成功还请传授经验。环境弄好啦，下次开始用这个框架写我的程序啦，任重而道远啊。。。。。

转载请注明原文地址: https://www.6miu.com/read-2300344.html

技术

最新回复(0)