:star: :penguin: :new: GPU Sku usage for Ubuntu 16.04-LTS and CentOS 7.4 , Standard Open Source Scheduler Deployments Torque, SLURM, PBSPro for HPC Skus for CentOS 7.4-HPC with OMS. This is presently on the GAed CentOS-HPC A9/H16R/H16MR and GPU NC6/NC12/NC24 . Latest Docker CE and nvidia-docker present in all. Updating for DIGITS for 2.0 on nvidia-runtime
This repo is inspired by Christian Smith’s repo https://github.com/smith1511/hpc
For portal Deployment, the following pic might assist.
This project is hosted at:
For the latest version, to contribute, and for more information, please go through this README.md.
To clone the current master (development) branch run:
git clone git://github.com/cloudgear-io/azure-bigcompute.git
docker run -dti --restart=always --name=azure-cli-python azuresdk/azure-cli-python && docker exec -ti azure-cli-python bash -c "az login && bash"
To sign in, use a web browser to open the page https://aka.ms/devicelogin and enter the code XXXXXXXXX to authenticate.
HPC Cluster (each H16R) with PBSPro and no OMS with head login user “azurehpcuser” and intern user “hpcgpu” - minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"pbspro\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
HPC Single H16R with PBSPro and no OMS with login user “azurehpcuser” and intern user “hpcgpu”- [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"pbspro\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
HPC Cluster (each H16R) with PBSPro with OMS with head login user “azurehpcuser” and intern user “hpcgpu”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"pbspro\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
HPC Single H16R with PBSPro with OMS with login user “azurehpcuser” and intern user “hpcgpu”- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"pbspro\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
HPC (each H16R) Cluster with Torque and no OMS with head login user “azurehpcuser” and intern user “hpcgpu”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"Torque\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
HPC Single H16R with Torque and no OMS with login user “azurehpcuser” and intern user “hpcgpu”- [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"Torque\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
HPC (each H16R) Cluster with Torque with OMS with head login user “azurehpcuser” and intern user “hpcgpu”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm":{\"value\":\"Torque\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
HPC single H16R with Torque with OMS with login user “azurehpcuser” and intern user “hpcgpu”- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"Torque\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
HPC (each H16R) Cluster with Slurm and no OMS with head login user “azurehpcuser” and intern user “hpcgpu”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"slurm\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
HPC Single H16R with Slurm and no OMS with login user “azurehpcuser” and intern user “hpcgpu”- [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"slurm\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
HPC (each H16R) Cluster with Slurm with OMS with head login user “azurehpcuser” and intern user “hpcgpu”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm":{\"value\":\"slurm\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 1},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
HPC single H16R with Slurm with OMS with login user “azurehpcuser” and intern user “hpcgpu”- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l southcentralus -n tsthpc && az group deployment create -g tsthpc -n tsthpc --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tsthpc\"},\"AdminUserName\":{\"value\":\"azurehpcuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS-HPC\"},\"ImageSku\":{\"value\":\"7.4\"},\"schedulerpbsORTorqueORslurm\":{\"value\":\"slurm\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_H16R\"},\"WorkerNodeCount\":{\"value\": 0},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
Ubuntu GPU Cluster (each NC24) with no scheduler and no OMS with head login user “azuregpuuser” and intern user “gpuclususer”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 1},\"GpuHpcUserName\":{\"value\":\"gpuclususer\"},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
Ubuntu Single NC24 with no scheduler and no OMS with head login user “azuregpuuser” and intern user “gpuuser”- [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"gpuuser\"},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
Ubuntu GPU Cluster (each NC24) with no scheduler with OMS with head login user “azuregpuuser” and intern user “gpuclususer”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 1},\"GpuHpcUserName\":{\"value\":\"gpuclususer\"},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
Ubuntu Single NC24 with no scheduler with OMS with head login user “azuregpuuser” and intern user “gpuuser”- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"gpuuser\"},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
CentOS GPU Cluster (each NC24) with no scheduler and no OMS with head login user “azuregpuuser” and intern user “gpuclususer”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 1},\"GpuHpcUserName\":{\"value\":\"gpuclususer\"},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
CentOS Single NC24 with no scheduler and no OMS with head login user “azuregpuuser” and intern user “gpuuser”- [provided sshpublickey value is supplied below]:
bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"gpuuser\"},\"NumDataDisks\":{\"value\":\"32\"}}" --debug
CentOS GPU Cluster (each NC24) with no scheduler with OMS with head login user “azuregpuuser” and intern user “gpuclususer”- minimum 1 head and minimum 1 worker [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"cluster\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 1},\"GpuHpcUserName\":{\"value\":\"gpuclususer\"},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
CentOS Single NC24 with no scheduler with OMS with head login user “azuregpuuser” and intern user “gpuuser”- [provided sshpublickey value is supplied below along with oMSWorkSpaceId and oMSWorkSpaceKey]:
bash-4.3# az group create -l eastus -n tstgpu4computes && az group deployment create -g tstgpu4computes -n tstgpu4computes --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"tstgpu4computes\"},\"AdminUserName\":{\"value\":\"azuregpuuser\"},\"SshPublicKey\":{\"value\":\"XXXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_NC24\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"gpuuser\"},\"NumDataDisks\":{\"value\":\"32\"},\"oMSWorkSpaceId\":{\"value\": \"xxxxxxxxxx\"},\"oMSWorkSpaceKey\":{\"value\": \"xxxxxxxxx\"}}" --debug
It is always great to build a Linux secure shell (SSH) jumpbox. Having a centralized location which can be used to quickly “jump” to any cluster saves a whole bunch of time. Not only that, it opens opportunities for speeding up repetitive chores during testing, deployment especially in a cloud-only environment.
This repository can be used for creating linux jumpboxes preferably Ubuntu-16.04-LTS or CentOS 7.3 as per the distro of choice.
For Linux, it is always a good idea to visit Azure virtual machines you can use to run your Linux apps and workloads.
One can also create excellent grade clusters by replacing single with cluster in the template parameters.
az group create -l westeurope -n centospublicwe && az group deployment create -g centospublicwe -n centospublicwe --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"centospublic\"},\"AdminUserName\":{\"value\":\"azureuser\"},\"SshPublicKey\":{\"value\":\"XXXXX\"},\"ImagePublisher\":{\"value\":\"openlogic\"},\"ImageOffer\":{\"value\":\"CentOS\"},\"ImageSku\":{\"value\":\"7.3\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_F2s\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"azure\"},\"MasterVMName\":{\"value\":\"centos73\"},\"NumDataDisks\":{\"value\":\"2\"}}" --debug
az group create -l westeurope -n ubuntupublicwe && az group deployment create -g ubuntupublicwe -n ubuntupublicwe --template-uri https://raw.githubusercontent.com/cloudgear-io/azure-bigcompute/master/azuredeploy.json --parameters "{\"singleOrCluster\":{\"value\":\"single\"},\"DnsLabelPrefix\":{\"value\":\"ubuntupublic\"},\"AdminUserName\":{\"value\":\"azureuser\"},\"SshPublicKey\":{\"value\":\"XXXXX\"},\"ImagePublisher\":{\"value\":\"Canonical\"},\"ImageOffer\":{\"value\":\"UbuntuServer\"},\"ImageSku\":{\"value\":\"16.04-LTS\"},\"HeadandWorkerNodeSize\":{\"value\":\"Standard_F2s\"},\"WorkerNodeCount\":{\"value\": 0},\"GpuHpcUserName\":{\"value\":\"azure\"},\"MasterVMName\":{\"value\":\"ubuntu1604\"},\"NumDataDisks\":{\"value\":\"2\"}}" --debug
azuregpuuser@DNS
.sudo su - --gpuclususer--
and then direct ssh.-DGMX_MPI=on
cmake option yum/apt-get install -y cmake
Then,
cd /opt && \
export GROMACS_DOWNLOAD_SUM=e9e3a41bd123b52fbcc6b32d09f8202b && export GROMACS_PKG_VERSION=2016.3 && curl -o gromacs-$GROMACS_PKG_VERSION.tar.gz -fsSL http://ftp.gromacs.org/pub/gromacs/gromacs-$GROMACS_PKG_VERSION.tar.gz && \
echo "$GROMACS_DOWNLOAD_SUM gromacs-$GROMACS_PKG_VERSION.tar.gz" | md5sum -c --strict - && \
tar xfz gromacs-$GROMACS_PKG_VERSION.tar.gz && \
cd gromacs-$GROMACS_PKG_VERSION && \
mkdir build-gromacs && \
cd build-gromacs && \
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-8.0 && \
make && \
make install && \
export PATH=/usr/local/gromacs/bin:$PATH
Post the above gmx would be available. For further reference please visit latest GROMACS manual
NVIDIA Tesla Driver Silent Install without further reboot installed via azuredeploy.sh
in this repository for cluster or single node as follows:
:grey_exclamation:
Currently, this need not be required when using secure cuda-repo-ubuntu1604_8.0.61-1_amd64.deb for Azure NC VMs running Ubuntu Server 16.04 LTS.
This is required for NVIDIA Driver with DKMS (Dynamic Kernel Module Support) for driver load surviving kernel updates.
service lightdm stop
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.39/NVIDIA-Linux-x86_64-375.39.run&lang=us&type=Tesla
apt-get install -y linux-image-virtual
apt-get install -y linux-virtual-lts-xenial
apt-get install -y linux-tools-virtual-lts-xenial linux-cloud-tools-virtual-lts-xenial
apt-get install -y linux-tools-virtual linux-cloud-tools-virtual
DEBIAN_FRONTEND=noninteractive apt-mark hold walinuxagent
DEBIAN_FRONTEND=noninteractive apt-get update -y
DEBIAN_FRONTEND=noninteractive apt-get install -y build-essential gcc gcc-multilib dkms g++ make binutils linux-headers-`uname -r` linux-headers-4.4.0-70-generic
chmod +x NVIDIA-Linux-x86_64-375.39.run
./NVIDIA-Linux-x86_64-375.39.run --silent --dkms
DEBIAN_FRONTEND=noninteractive update-initramfs -u
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.39/NVIDIA-Linux-x86_64-375.39.run&lang=us&type=Tesla
yum clean all
yum update -y dkms
yum install -y gcc make binutils gcc-c++ kernel-devel kernel-headers --disableexcludes=all
yum -y upgrade kernel kernel-devel
chmod +x NVIDIA-Linux-x86_64-375.39.run
cat >>~/install_nvidiarun.sh <<EOF
cd /var/lib/waagent/custom-script/download/0 && \
./NVIDIA-Linux-x86_64-375.39.run --silent --dkms --install-libglvnd && \
sed -i '$ d' /etc/rc.d/rc.local && \
chmod -x /etc/rc.d/rc.local
rm -rf ~/install_nvidiarun.sh
EOF
chmod +x install_nvidiarun.sh
echo -ne "~/install_nvidiarun.sh" >> /etc/rc.d/rc.local
chmod +x /etc/rc.d/rc.local
Silent and Secure installation of NVIDIA CUDA Toolkit via azuredeploy.sh
in this repository for cluster or single node.
CUDA_REPO_PKG=cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
DEBIAN_FRONTEND=noninteractive apt-mark hold walinuxagent
export CUDA_DOWNLOAD_SUM=1f4dffe1f79061827c807e0266568731 && export CUDA_PKG_VERSION=8-0 && curl -o cuda-repo.deb -fsSL http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG} && \
echo "$CUDA_DOWNLOAD_SUM cuda-repo.deb" | md5sum -c --strict - && \
dpkg -i cuda-repo.deb && \
rm cuda-repo.deb && \
apt-get update -y && apt-get install -y cuda && \
apt-get install -y nvidia-cuda-toolkit && \
export LIBRARY_PATH=/usr/local/cuda-8.0/lib64/:${LIBRARY_PATH} && export LIBRARY_PATH=/usr/local/cuda-8.0/lib64/stubs:${LIBRARY_PATH} && \
export PATH=/usr/local/cuda-8.0/bin:${PATH}
wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm
rpm -i cuda-repo-rhel7-8.0.61-1.x86_64.rpm
yum clean all
yum install -y cuda
CUDA Samples installed via azuredeploy.sh
in this repository cluster or single node in parameterized RAID0 location as follows for Ubuntu:
export SHARE_DATA="/data/data"
export SAMPLES_USER="gpuuser"
su -c "/usr/local/cuda-8.0/bin/./cuda-install-samples-8.0.sh $SHARE_DATA" $SAMPLES_USER
In /usr/local/cuda-8.0/samples for CentOS 7.3.
The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
cuDNN is part of the NVIDIA Deep Learning SDK and is installed silently as follows via azuredeploy.sh
in this repository cluster or single node.
export CUDNN_DOWNLOAD_SUM=a87cb2df2e5e7cc0a05e266734e679ee1a2fadad6f06af82a76ed81a23b102c8 && curl -fsSL http://developer.download.nvidia.com/compute/redist/cudnn/v5.1/cudnn-8.0-linux-x64-v5.1.tgz -O && \
echo "$CUDNN_DOWNLOAD_SUM cudnn-8.0-linux-x64-v5.1.tgz" | sha256sum -c --strict - && \
tar -xzf cudnn-8.0-linux-x64-v5.1.tgz -C /usr/local && \
rm cudnn-8.0-linux-x64-v5.1.tgz && \
ldconfig
nvidia-docker version parameterized binary installation is automated for both Ubuntu 16.04-LTS and CentOS 7.3
The latest nvidia 2.0 runtime for docker is available and auto running post cluster provisiong
DIGITS Latest is available @
http://<Cluster_Public_IP>:5000
and Tensorboard @
http://<Cluster_Public_IP>:6006
Besides, Latest Installation of NVIDIA CUDA Toolkit during provisioning via this repo:
nvidia-docker can be leveraged for usage of dockerized CUDA Toolkit Usage as per the test and picture below. This opens up possibilities of using “py” and “gpu” tagged images of cntk, tensorflow, theano and more available as nightly builds from docker hub with jupyter notebooks. Latest gitlab.com/nvidia cudnn RCs can be used for testing.
sudo systemctl start nvidia-docker
nvidia-docker run --rm nvidia/cuda nvidia-smi
More Information available @ https://github.com/NVIDIA/nvidia-docker/wiki
By provisioning via this repository, you agree to the terms of the license agreements for NVIDIA software installed silently.
To view the license for the CUDA Toolkit , click here
To view the license for cuDNN click here
Details
azurehpcuser@DNS
.sudo su - --hpc user--
and then direct ssh.All path are set automatically for key ‘default’ provided users like azureuser/hpc.
for root specific su - root
is required.
source /opt/intel/impi/5.1.3.223/bin64/mpivars.sh
mpirun -ppn 1 -n 2 -hosts headN,compn0 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 hostname
(Cluster Check)
mpirun -hosts headN,compn0 -ppn --processes per node in number-- -n --number of consequtive processes-- -env I_MPI_FABRICS=dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 IMB-MPI1 pingpong
(Base Pingpong stats)
ls /sys/class/infiniband
cat /sys/class/infiniband/mlx4_0/ports/1/state
/etc/init.d/opensmd start
(if required)
cat /sys/class/infiniband/mlx4_0/ports/1/rate
All computes would have automatic pbs_mom and head the pbs_mom and pbs_server for latest Torque or Pbspro from their respective master repos made from source during cluster provision time. No post installation tasks are required post successful cluster deployment except if np is to be increased from 1.
check for Torque or PBSPro via
pbsnodes -a
All path are set automatically for key ‘default’ users like azureuser/hpc/root
for root specific su - root
is required.
To check slurm info please shoot : sinfo -N -l
Since this is is Intel MPI, preferred usage is using mpirun
with sbatch
OMS Setup is optional and the OMS Workspace Id and OMS Workspace Key can either be kept blank or populated post the steps below.
Create a free account for MS Azure Operational Management Suite with workspaceName
ba1e3f33-648d-40a1-9c70-3d8920834669
and the ‘Primary and/or Secondary Key’ like xkifyDr2s4L964a/Skq58ItA/M1aMnmumxmgdYliYcC2IPHBPphJgmPQrKsukSXGWtbrgkV2j1nHmU0j8I8vVQ==
Please report bugs by opening an issue in the GitHub Issue Tracker
Patches can be submitted as GitHub pull requests. If using GitHub please make sure your branch applies to the current master as a ‘fast forward’ merge (i.e. without creating a merge commit). Use the git rebase
command to update your branch to the current master if necessary.