azure-bigcompute

:star: :penguin: :new: GPU Sku usage for Ubuntu 16.04-LTS and CentOS 7.4 , Standard Open Source Scheduler Deployments Torque, SLURM, PBSPro for HPC Skus for CentOS 7.4-HPC with OMS. This is presently on the GAed CentOS-HPC A9/H16R/H16MR and GPU NC6/NC12/NC24 . Latest Docker CE and nvidia-docker present in all. Updating for DIGITS for 2.0 on nvidia-runtime

View the Project on GitHub cloudgear-io/azure-bigcompute

Build Status

Table of Contents

Deploy from Portal and visualize

Deploy to Azure


For portal Deployment, the following pic might assist.

azureportaldeploy

This project is hosted at:

For the latest version, to contribute, and for more information, please go through this README.md.

To clone the current master (development) branch run:

git clone git://github.com/cloudgear-io/azure-bigcompute.git

Single or Cluster Topology Examples with Azure CLI

New Azure CLI

docker run -dti --restart=always --name=azure-cli-python azuresdk/azure-cli-python && docker exec -ti azure-cli-python bash -c "az login && bash" To sign in, use a web browser to open the page https://aka.ms/devicelogin and enter the code XXXXXXXXX to authenticate.

HPC with RDMA over IB

GPU Computes

Ubuntu 16.04-LTS

It is always great to build a Linux secure shell (SSH) jumpbox. Having a centralized location which can be used to quickly “jump” to any cluster saves a whole bunch of time. Not only that, it opens opportunities for speeding up repetitive chores during testing, deployment especially in a cloud-only environment.

This repository can be used for creating linux jumpboxes preferably Ubuntu-16.04-LTS or CentOS 7.3 as per the distro of choice.

For Linux, it is always a good idea to visit Azure virtual machines you can use to run your Linux apps and workloads.

One can also create excellent grade clusters by replacing single with cluster in the template parameters.

az group create -l westeurope -n centospublicwe && az group deployment create -g centospublicwe -n centospublicwe --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"centospublic\"},\"AdminUserName\":{\"value\":\"azureuser\"},\"SshPublicKey\":{\"value\":\"XXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_F2s\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"azure\"},\"MasterVMName\":{\"value\":\"centos73\"},\"NumDataDisks\":{\"value\":\"2\"}}" --debug
az group create -l westeurope -n ubuntupublicwe && az group deployment create -g ubuntupublicwe -n ubuntupublicwe --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"ubuntupublic\"},\"AdminUserName\":{\"value\":\"azureuser\"},\"SshPublicKey\":{\"value\":\"XXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_F2s\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"azure\"},\"MasterVMName\":{\"value\":\"ubuntu1604\"},\"NumDataDisks\":{\"value\":\"2\"}}" --debug

GPUs for Compute

Azure GPUs

Try CUDA Samples and GROMACS

	yum/apt-get install -y cmake

Then,

	cd /opt && \
	export GROMACS_DOWNLOAD_SUM=e9e3a41bd123b52fbcc6b32d09f8202b && export GROMACS_PKG_VERSION=2016.3 && curl -o gromacs-$GROMACS_PKG_VERSION.tar.gz -fsSL http://ftp.gromacs.org/pub/gromacs/gromacs-$GROMACS_PKG_VERSION.tar.gz && \
	echo "$GROMACS_DOWNLOAD_SUM  gromacs-$GROMACS_PKG_VERSION.tar.gz" | md5sum -c --strict - && \
	tar xfz gromacs-$GROMACS_PKG_VERSION.tar.gz && \
	cd gromacs-$GROMACS_PKG_VERSION && \
	mkdir build-gromacs && \
	cd build-gromacs && \
	cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-8.0 && \
	make && \
	make install && \
	export PATH=/usr/local/gromacs/bin:$PATH

Post the above gmx would be available. For further reference please visit latest GROMACS manual

Unattended NVIDIA Tesla Driver Silent Install without further reboot during provisioning via this repo

NVIDIA Tesla Driver Silent Install without further reboot installed via azuredeploy.sh in this repository for cluster or single node as follows:

:grey_exclamation:

Currently, this need not be required when using secure cuda-repo-ubuntu1604_8.0.61-1_amd64.deb for Azure NC VMs running Ubuntu Server 16.04 LTS.

This is required for NVIDIA Driver with DKMS (Dynamic Kernel Module Support) for driver load surviving kernel updates.

Ubuntu 16.04-LTS

	service lightdm stop 
	wget  http://us.download.nvidia.com/XFree86/Linux-x86_64/375.39/NVIDIA-Linux-x86_64-375.39.run&lang=us&type=Tesla
	apt-get install -y linux-image-virtual
	apt-get install -y linux-virtual-lts-xenial
	apt-get install -y linux-tools-virtual-lts-xenial linux-cloud-tools-virtual-lts-xenial
	apt-get install -y linux-tools-virtual linux-cloud-tools-virtual
	DEBIAN_FRONTEND=noninteractive apt-mark hold walinuxagent
	DEBIAN_FRONTEND=noninteractive apt-get update -y
	DEBIAN_FRONTEND=noninteractive apt-get install -y build-essential gcc gcc-multilib dkms g++ make binutils linux-headers-`uname -r` linux-headers-4.4.0-70-generic
	chmod +x NVIDIA-Linux-x86_64-375.39.run
	./NVIDIA-Linux-x86_64-375.39.run  --silent --dkms
	DEBIAN_FRONTEND=noninteractive update-initramfs -u

CentOS 7.3

	wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.39/NVIDIA-Linux-x86_64-375.39.run&lang=us&type=Tesla
	yum clean all
	yum update -y  dkms
	yum install -y gcc make binutils gcc-c++ kernel-devel kernel-headers --disableexcludes=all
	yum -y upgrade kernel kernel-devel
	chmod +x NVIDIA-Linux-x86_64-375.39.run
	cat >>~/install_nvidiarun.sh <<EOF
	cd /var/lib/waagent/custom-script/download/0 && \
	./NVIDIA-Linux-x86_64-375.39.run --silent --dkms --install-libglvnd && \
	sed -i '$ d' /etc/rc.d/rc.local && \
	chmod -x /etc/rc.d/rc.local
	rm -rf ~/install_nvidiarun.sh
	EOF
	chmod +x install_nvidiarun.sh
	echo -ne "~/install_nvidiarun.sh" >> /etc/rc.d/rc.local
	chmod +x /etc/rc.d/rc.local

Installation of NVIDIA CUDA Toolkit during provisioning via this repo

Silent and Secure installation of NVIDIA CUDA Toolkit via azuredeploy.sh in this repository for cluster or single node.

Ubuntu 16.04-LTS

 CUDA_REPO_PKG=cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
 DEBIAN_FRONTEND=noninteractive apt-mark hold walinuxagent
 export CUDA_DOWNLOAD_SUM=1f4dffe1f79061827c807e0266568731 && export CUDA_PKG_VERSION=8-0 && curl -o cuda-repo.deb -fsSL http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG} && \
     echo "$CUDA_DOWNLOAD_SUM  cuda-repo.deb" | md5sum -c --strict - && \
     dpkg -i cuda-repo.deb && \
     rm cuda-repo.deb && \
     apt-get update -y && apt-get install -y cuda && \
     apt-get install -y nvidia-cuda-toolkit && \
 export LIBRARY_PATH=/usr/local/cuda-8.0/lib64/:${LIBRARY_PATH}  && export LIBRARY_PATH=/usr/local/cuda-8.0/lib64/stubs:${LIBRARY_PATH} && \
 export PATH=/usr/local/cuda-8.0/bin:${PATH}

CentOS 7.3

	wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm
	rpm -i cuda-repo-rhel7-8.0.61-1.x86_64.rpm
	yum clean all
	yum install -y cuda
CUDA Samples Install
Ubuntu 16.04-LTS

CUDA Samples installed via azuredeploy.sh in this repository cluster or single node in parameterized RAID0 location as follows for Ubuntu:

 export SHARE_DATA="/data/data"
 export SAMPLES_USER="gpuuser"
 su -c "/usr/local/cuda-8.0/bin/./cuda-install-samples-8.0.sh $SHARE_DATA" $SAMPLES_USER

Centos 7.3

In /usr/local/cuda-8.0/samples for CentOS 7.3.

Secure installation of CUDNN during provisioning via this repo

Both Ubuntu 16.04-LTS and CentOS 7.3

The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers. cuDNN is part of the NVIDIA Deep Learning SDK and is installed silently as follows via azuredeploy.sh in this repository cluster or single node.

    export CUDNN_DOWNLOAD_SUM=a87cb2df2e5e7cc0a05e266734e679ee1a2fadad6f06af82a76ed81a23b102c8 && curl -fsSL http://developer.download.nvidia.com/compute/redist/cudnn/v5.1/cudnn-8.0-linux-x64-v5.1.tgz -O && \
    echo "$CUDNN_DOWNLOAD_SUM  cudnn-8.0-linux-x64-v5.1.tgz" | sha256sum -c --strict - && \
    tar -xzf cudnn-8.0-linux-x64-v5.1.tgz -C /usr/local && \
    rm cudnn-8.0-linux-x64-v5.1.tgz && \
    ldconfig

nvidia-docker usage

nvidia-docker version parameterized binary installation is automated for both Ubuntu 16.04-LTS and CentOS 7.3

The latest nvidia 2.0 runtime for docker is available and auto running post cluster provisiong

DIGITS with docker runtime nvidia 2 and tensorboard within DIGITS

DIGITS Latest is available @

http://<Cluster_Public_IP>:5000

and Tensorboard @

http://<Cluster_Public_IP>:6006
Notes on nvidia-docker usage

Besides, Latest Installation of NVIDIA CUDA Toolkit during provisioning via this repo:

nvidia-docker can be leveraged for usage of dockerized CUDA Toolkit Usage as per the test and picture below. This opens up possibilities of using “py” and “gpu” tagged images of cntk, tensorflow, theano and more available as nightly builds from docker hub with jupyter notebooks. Latest gitlab.com/nvidia cudnn RCs can be used for testing.

sudo systemctl start nvidia-docker

nvidia-docker run --rm nvidia/cuda nvidia-smi

nvidiadocker

More Information available @ https://github.com/NVIDIA/nvidia-docker/wiki

License Agreements

By provisioning via this repository, you agree to the terms of the license agreements for NVIDIA software installed silently.

CUDA Toolkit

To view the license for the CUDA Toolkit , click here

CUDA Deep Neural Network library (cuDNN)

To view the license for cuDNN click here

H-Series and A9 with schedulers

Details

mpirun

All path are set automatically for key ‘default’ provided users like azureuser/hpc. for root specific su - root is required.

source /opt/intel/impi/5.1.3.223/bin64/mpivars.sh

mpirun -ppn 1 -n 2 -hosts headN,compn0 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 hostname (Cluster Check)

mpirun -hosts headN,compn0 -ppn --processes per node in number-- -n --number of consequtive processes-- -env I_MPI_FABRICS=dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 IMB-MPI1 pingpong (Base Pingpong stats)

IB

ls /sys/class/infiniband

cat /sys/class/infiniband/mlx4_0/ports/1/state

/etc/init.d/opensmd start (if required)

cat /sys/class/infiniband/mlx4_0/ports/1/rate

Torque and pbspro for CentOS-HPC Skus

All computes would have automatic pbs_mom and head the pbs_mom and pbs_server for latest Torque or Pbspro from their respective master repos made from source during cluster provision time. No post installation tasks are required post successful cluster deployment except if np is to be increased from 1.

check for Torque or PBSPro via pbsnodes -a

PBS Pro License

All path are set automatically for key ‘default’ users like azureuser/hpc/root for root specific su - root is required.

SLURM LICENSE AGREEMENT - GPL v2

To check slurm info please shoot : sinfo -N -l

Since this is is Intel MPI, preferred usage is using mpirun with sbatch

Optional usage with OMS

OMS Setup is optional and the OMS Workspace Id and OMS Workspace Key can either be kept blank or populated post the steps below.

Create a free account for MS Azure Operational Management Suite with workspaceName

Reporting bugs

Please report bugs by opening an issue in the GitHub Issue Tracker

Patches and pull requests

Patches can be submitted as GitHub pull requests. If using GitHub please make sure your branch applies to the current master as a ‘fast forward’ merge (i.e. without creating a merge commit). Use the git rebase command to update your branch to the current master if necessary.

Region availability and Quotas for MS Azure Skus