CUDA

CUDA is a parallel computing platform and programming model for GPUs invented by NVIDIA.

Introduction

Scientific simulations can often be significantly accelerated by hardware accelerators such as Graphics Processing Units (GPUs). GPUs are available on several HPC nodes.  The GPUs currently available are NVIDIA GeForce GTX1080 Ti, which is of the Pascal micro-architecture, and of compute capability 6.1. The CUDA driver version is 9.2.  The following table shows the key paramters of the GPU at the RCC:

Brand Name GTX1080 Ti
Compute Capability 6.1
Micro-Architecture Pascal
Number Stream Multi-Processors 28
Number of CUDA Cores 3584
Boost Clock 1600 MHZ
Memory Capacity 11 GB
Memory Bandwidth ~484GBs
FP32 TFLOPS ~11.4 TFLOPS

Compile CUDA code

To compile CUDA/C/C++ code, first load the cuda module 

$ module load cuda

The cuda compiler nvcc should be immediately available,

$ which nvcc
/usr/local/cuda/bin/nvcc

and you can check the cuda version via

$ nvcc -V
Copyright (c) 2005-2017 NVIDIA Corporation
Cuda compilation tools, release 9.0, V9.0.176

You can then compile your cuda/c/c++ code via the cuda nvcc compiler

$ nvcc -O3 -arch sm_61 -o a.out a.cu

In the above, the compiler option "-arch sm_61" specify the compute capability 6.1 for the Pascal micro-architecture.

Submit a CUDA Job

To submit a GPU job to the HPC cluster, first create a SLURM submit script sub.sh similar to the following

#!/bin/bash

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -J "cuda-job"
#SBATCH -t 4:00:00
#SBATCH -p backfill
#SBATCH --gres=gpu:1
#SBATCH --mail-type=ALL

# load the cuda module to set up the environment
module load cuda

# the following line should provide the full path to the cuda compiler
which nvcc

# execute your cuda executable a.out
srun -n 1 ./a.out <input.dat >output.txt

Not all computer nodes have GPU cards, and a GPU node contains up to 4 GPU cards. In order to require a compute node with GPUs,  add the following line to your submit script 

#SBATCH --gres=gpu:[1-4]    # <-- Choose between 1 and 4 GPU cards to use.

Then submit the job via

$ sbatch sub.sh

Cuda Sample Code

The following cuda code example deviceQuery.cu can help new users to get familar to the GPUs availalbe on the HPC cluster:

#include <stdio.h>
#include <cuda_runtime.h>

int main( ) {

    int dev = 0;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, dev);
    printf("device id %d, name %s\n", dev, prop.name);
    printf("number of multi-processors = %d\n", 
        prop.multiProcessorCount);
    printf("Total constant memory: %4.2f kb\n", 
        prop.totalConstMem/1024.0);
    printf("Shared memory per block: %4.2f kb\n",
        prop.sharedMemPerBlock/1024.0);
    printf("Total registers per block: %d\n", 
        prop.regsPerBlock);
    printf("Maximum threads per block: %d\n", 
        prop.maxThreadsPerBlock);
    printf("Maximum threads per multi-processor: %d\n", 
        prop.maxThreadsPerMultiProcessor);
    printf("Maximum number of warps per multi-processor %d\n",
        prop.maxThreadsPerMultiProcessor/32);

    return 0;
}

Compile the code via

$ module load cuda
$ nvcc -o deviceQuery deviceQuery.cu

The output will be similar to the following upon a successful run

device id 0, name GeForce GTX 1080 Ti
number of multi-processors = 28
Total constant memory: 64.00 kb
Shared memory per block: 48.00 kb
Total registers per block: 65536
Maximum threads per block: 1024
Maximum threads per multi-processor: 2048
Maximum number of warps per multi-processor 64