linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程
本文首发于个人博客 https://kezunlin.me/post/6b505d27/
,欢迎阅读最新内容!
full guide tutorial to install and configure deep learning environments on linux server
Quick Guide
prepare
tools
MobaXterm (for windows)
ssh + vscode
for windows:
drop files to MobaXterm to upload to server
use zip
format
commands
view disk
du -d 1 -h df -h
gpu and cpu usage
watch -n 1 nvidia-smi top
view files and count
wc -l data.csv # count how many folders ls -lR | grep '^d' | wc -l 17 # count how many jpg files ls -lR | grep '.jpg' | wc -l 1360 # view 10 images ls train | head ls test | head
link datasets
# link ln -s srt dest ln -s /data_1/kezunlin/datasets/ dl4cv/datasets
scp
scp -r node17:~/dl4cv ~/git/ scp -r node17:~/.keras ~/
tmux for background tasks
tmux new -s notebook tmux ls tmux attach -t notebook tmux detach
wget download
# wget # continue donwload wget -c url # background donwload for large file wget -b -c url tail -f wget-log # kill background wget pkill -9 wget
tips about training large model
terminal 1:
tmux new -s train conda activate keras time python train_alexnet.py
terminal 2:
tmux detach tmux attach -t train
and then close vscode, otherwise bash training process will exit when we close vscode.
cuda driver and toolkits
see cuda-toolkit for cuda driver version
cudatookit version depends on cuda driver version.
install nvidia-drivers
sudo add-apt-repository ppa:graphics-drivers/ppa sudp apt-get update sudo apt-cache search nvidia-* # nvidia-384 # nvidia-396 sudo apt-get -y install nvidia-418 # test nvidia-smi Failed to initialize NVML: Driver/library version mismatch
reboot to test again
https://stackoverflow.com/que…
install cuda-toolkit(dirvers)
remove all previous nvidia drivers
sudo apt-get -y pruge nvidia-*
go to here
and download cuda_10.1
wget -b -c http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run sudo sh cuda_10.1.243_418.87.00_linux.run sudo ./cuda_10.1.243_418.87.00_linux.run vim .bashrc # for cuda and cudnn export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
check cuda driver version
> cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.87.00 Thu Aug 8 15:35:46 CDT 2019 GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) >nvidia-smi Tue Aug 27 17:36:35 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ > nvidia-smi -L GPU 0: Quadro RTX 8000 (UUID: GPU-acb01c1b-776d-cafb-ea35-430b3580d123) GPU 1: Quadro RTX 8000 (UUID: GPU-df7f0fb8-1541-c9ce-e0f8-e92bccabf0ef) GPU 2: Quadro RTX 8000 (UUID: GPU-67024023-20fd-a522-dcda-261063332731) GPU 3: Quadro RTX 8000 (UUID: GPU-7f9d6a27-01ec-4ae5-0370-f0c356327913) > nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
<ins
style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-5653382914441020" data-ad-slot="7925631830">
(adsbygoogle = window.adsbygoogle || []).push({});
install conda
./Anaconda3-2019.03-Linux-x86_64.sh [yes] [yes]
config channels
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ conda config --set show_channel_urls yes
install libraries
conclusions:
py37/keras: conda install -y tensorflow-gpu keras==2.2.5
py37/torch: conda install -y pytorch torchvision
py36/mxnet: conda install -y mxnet
keras 2.2.5 was released on 2019/8/23.
common libraries
conda install -y scikit-learn scikit-image pandas matplotlib pillow opencv seaborn pip install imutils progressbar pydot pylint
pip install imutils
to avoid downgrade for tensorflow-gpu
py37
cudatoolkit 10.0.130 0 cudnn 7.6.0 cuda10.0_0 tensorflow-gpu 1.13.1
py36
cudatoolkit anaconda/pkgs/main/linux-64::cudatoolkit-10.1.168-0 cudnn anaconda/pkgs/main/linux-64::cudnn-7.6.0-cuda10.1_0 tensorboard anaconda/pkgs/main/linux-64::tensorboard-1.14.0-py36hf484d3e_0 tensorflow anaconda/pkgs/main/linux-64::tensorflow-1.14.0-gpu_py36h3fb9ad6_0 tensorflow-base anaconda/pkgs/main/linux-64::tensorflow-base-1.14.0-gpu_py36he45bfe2_0 tensorflow-estima~ anaconda/cloud/conda-forge/linux-64::tensorflow-estimator-1.14.0-py36h5ca1d4c_0 tensorflow-gpu anaconda/pkgs/main/linux-64::tensorflow-gpu-1.14.0-h0d30ee6_0
imutils only support 36 and 37.
details
# remove py35 conda remove -n py35 --all conda info --envs conda create -n py37 python==3.7 conda activate py37 # common libraries conda install -y scikit-learn pandas pillow opencv pip install imutils # imutils conda search imutils # py36 and py37 # Name Version Build Channel imutils 0.5.2 py27_0 anaconda/cloud/conda-forge imutils 0.5.2 py36_0 anaconda/cloud/conda-forge imutils 0.5.2 py37_0 anaconda/cloud/conda-forge # tensorflow-gpu and keras conda install -y tensorflow-gpu keras # install pytorch conda install -y pytorch torchvision # install mxnet # method 1: pip pip search mxnet mxnet-cu80[mkl]/mxnet-cu90[mkl]/mxnet-cu91[mkl]/mxnet-cu92[mkl]/mxnet-cu100[mkl]/mxnet-cu101[mkl] # method 2: conda conda install mxnet # py35 and py36
TensorFlow Object Detection API
home page: home page
download tensorflow models
and rename models-master
to tfmodels
vim ~/.bashrc
export PYTHONPATH=/home/kezunlin/dl4cv:/data_1/kezunlin/tfmodels/research:$PYTHONPATH
source ~/.bashrc
jupyter notebook
conda activate py37 conda install -y jupyter
install kernels
python -m ipykernel install --user --name=py37 Installed kernelspec py37 in /home/kezunlin/.local/share/jupyter/kernels/py37
config for server
python -c "import IPython;print(IPython.lib.passwd())" Enter password: Verify password: sha1:ef2fb2aacff2:4ea2998699638e58d10d594664bd87f9c3381c04 jupyter notebook --generate-config Writing default config to: /home/kezunlin/.jupyter/jupyter_notebook_config.py vim .jupyter/jupyter_notebook_config.py c.NotebookApp.ip = '*' c.NotebookApp.password = u'sha1:xxx:xxx' c.NotebookApp.open_browser = False c.NotebookApp.port = 8888 c.NotebookApp.enable_mathjax = True
run jupyter on background
tmux new -s notebook jupyter notebook # ctlr+b+d exit session and DO NOT close session # ctlr+d exit session and close session
access web
and input password
test
py37
import cv2 cv2.__version import tensorflow as tf import keras import torch import torchvision
cat .keras/keras.json
{ "epsilon": 1e-07, "floatx": "float32", "backend": "tensorflow", "image_data_format": "channels_last" }
py36
import mxnet
train demo
export
# use CPU only export CUDA_VISIBLE_DEVICES="" # use gpu 0 1 export CUDA_VISIBLE_DEVICES="0,1"
code
import os os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"
start train
python train.py
./keras folder
view keras models and datasets
ls .keras/ datasets keras.json models
models saved to /home/kezunlin/.keras/models/
datasets saved to /home/kezunlin/.keras/datasets/
models lists
xxx_kernels_notop.h5
for include_top = False
xxx_kernels.h5
for include_top = True
<ins
style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-5653382914441020" data-ad-slot="7925631830">
(adsbygoogle = window.adsbygoogle || []).push({});
Datasets
mnist
cifar10
to skip download
wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz mv ~/Download/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz
to load data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
flowers-17
animals
panda images are WRONG !!!
counts
ls -lR animals/cat | grep ".jpg" | wc -l 1000 ls -lR animals/dog | grep ".jpg" | wc -l 1000 ls -lR animals/panda | grep ".jpg" | wc -l 1000
kaggle cats vs dogs
caltech101
download background
wget -b -c http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz
Kaggle API
install and config
see kaggle-api
conda activate keras conda install kaggle # download kaggle.json mv kaggle.json ~/.kaggle/kaggle.json chmod 600 ~/.kaggle/kaggle.json cat kaggle.json {"username":"xxx","key":"yyy"}
or by export
export KAGGLE_USERNAME=xxx export KAGGLE_KEY=yyy
tips
-
- go to
-
- and select ‘Create API Token’ and
keras.json
-
- will be downloaded.
-
- Ensure
kaggle.json
-
- is in the location
~/.kaggle/kaggle.json
-
- to use the API.
check version
kaggle --version Kaggle API 1.5.5
commands overview
commands
kaggle competitions {list, files, download, submit, submissions, leaderboard} kaggle datasets {list, files, download, create, version, init} kaggle kernels {list, init, push, pull, output, status} kaggle config {view, set, unset}
download datasets
kaggle competitions download -c dogs-vs-cats
show leaderboard
kaggle competitions leaderboard dogs-vs-cats --show teamId teamName submissionDate score ------ --------------------------------- ------------------- ------- 71046 Pierre Sermanet 2014-02-01 21:43:19 0.98533 66623 Maxim Milakov 2014-02-01 18:20:58 0.98293 72059 Owen 2014-02-01 17:04:40 0.97973 74563 Paul Covington 2014-02-01 23:05:20 0.97946 74298 we've been in KAIST 2014-02-01 21:15:30 0.97840 71949 orchid 2014-02-01 23:52:30 0.97733
set default competition
kaggle config set --name competition --value dogs-vs-cats - competition is now set to: dogs-vs-cats kaggle config set --name competition --value dogs-vs-cats-redux-kernels-edition
dogs-vs-cats
submit
kaggle c submissions - Using competition: dogs-vs-cats - No submissions found kaggle c submit -f ./submission.csv -m "first submit"
competition has already ended, so can not submit.
Nvidia-docker and containers
install
sudo apt-get -y install docker # Install nvidia-docker2 and reload the Docker daemon configuration sudo apt-get install -y nvidia-docker2 sudo pkill -SIGHUP dockerd
restart (optional)
cat /etc/docker/daemon.json
{ "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
sudo systemctl enable docker sudo systemctl start docker
if errors occur:
Job for docker.service failed because the control process exited with error code.
See “systemctl status docker.service” and “journalctl -xe” for details.
check /etc/docker/daemon.json
test
sudo docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi sudo nvidia-docker run --rm nvidia/cuda:10.1-base nvidia-smi Thu Aug 29 00:11:32 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro RTX 8000 Off | 00000000:02:00.0 Off | Off | | 43% 67C P2 136W / 260W | 46629MiB / 48571MiB | 17% Default | +-------------------------------+----------------------+----------------------+ | 1 Quadro RTX 8000 Off | 00000000:03:00.0 Off | Off | | 34% 54C P0 74W / 260W | 0MiB / 48571MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Quadro RTX 8000 Off | 00000000:82:00.0 Off | Off | | 34% 49C P0 73W / 260W | 0MiB / 48571MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Quadro RTX 8000 Off | 00000000:83:00.0 Off | Off | | 33% 50C P0 73W / 260W | 0MiB / 48571MiB | 3% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+
add user to docker
group, and no need to use sudo docker xxx
command refs
sudo nvidia-docker run --rm nvidia/cuda:10.1-base nvidia-smi sudo nvidia-docker -t -i --privileged nvidia/cuda bash sudo docker run -it --name kzl -v /home/kezunlin/workspace/:/home/kezunlin/workspace nvidia/cuda
Reference
History
20190821: created.
Copyright
Post author: kezunlin
Post link: https://kezunlin.me/post/6b505d27/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 3.0 unless stating additionally.