Skip to content

CUDA

A C/C++/Fortran parallel computing platform and application programming interface (API) that allows software to use graphics processing units (GPUs) for general purpose processing.


CUDA requires an environment module

In order to use CUDA, you must first load the appropriate environment module:

module load cuda

Warning

Due to disk space constraints, NVIDIA CUDA libraries are avaialble only on the login nodes and GPU nodes. They are not available on general-purpose compute nodes. Be sure to specify the Slurm --gres:gpu=[1-4] option when submitting jobs to the cluster.

Compiling with CUDA#

Once you have loaded the CUDA module (module load cuda), the nvcc command will be available:

1
2
3
4
5
6
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

You can now use the nvcc command to compile C/C++ code:

$ nvcc -03 -arch sm_61 -o a.out a.cu

In the above example, the compiler option -arch sm_61 specifies the compute capability 6.1 for the Pascal micro-architecture.

Submit CUDA jobs#

CUDA jobs are similar to regular HPC jobs, with two additional considerations:

  1. You need to request GPU resources from the scheduler with the --gres=gpu:1 option.
  2. You need to load the CUDA module (module load cuda)

Below is an example job submit script for a CUDA job:

#!/bin/bash

#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --account genacc_q
#SBATCH --gres=gpu:1  # Ensure your job is scheduled on a node with GPU resources 

# Load CUDA module libraries
module load cuda # Load the CUDA libraries into the environment 

# Execute your CUDA code
srun -n 1 ./my_cuda_code < input.dat > output.txt

CUDA Example#

The following CUDA code example can help new users get familiar with the GPU resources available in the HPC cluster.

Create a file called deviceQuery.cu:

    #include <stdio.h>
    #include <cuda_runtime.h>
    int main( ) {
        int dev = 0;
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);
        printf("device id %d, name %s\n", dev, prop.name);
        printf("number of multi-processors = %d\n", 
            prop.multiProcessorCount);
        printf("Total constant memory: %4.2f kb\n", 
            prop.totalConstMem/1024.0);
        printf("Shared memory per block: %4.2f kb\n",
            prop.sharedMemPerBlock/1024.0);
        printf("Total registers per block: %d\n", 
            prop.regsPerBlock);
        printf("Maximum threads per block: %d\n", 
            prop.maxThreadsPerBlock);
        printf("Maximum threads per multi-processor: %d\n", 
            prop.maxThreadsPerMultiProcessor);
        printf("Maximum number of warps per multi-processor %d\n",
            prop.maxThreadsPerMultiProcessor/32);
        return 0;
    }

Compile the code:

$ module load cuda
$ nvcc -o deviceQuery deviceQuery.cu

Create the job submit script (gpu_test.sh or some-such):

#!/bin/bash

#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --account backfill
#SBATCH -t 05:00
#SBATCH --gres=gpu:2 
#SBATCH --mail-type=ALL

# Load CUDA module libraries
module load cuda

# Execute your CUDA code
srun -n 1 ./deviceQuery

Submit the job:

$ sbatch gpu_test.sh

Wait for the job to finish running. When it finishes, the output should look something like the following:

1
2
3
4
5
6
7
8
device id 0, name GeForce GTX 1080 Ti
number of multi-processors = 28
Total constant memory: 64.00 kb
Shared memory per block: 48.00 kb
Total registers per block: 65536
Maximum threads per block: 1024
Maximum threads per multi-processor: 2048
Maximum number of warps per multi-processor 64