Using GPUs on the HPC Cluster

The HPC Cluster includes a speciality node, hpc-15-36, which contains two TESLA M2050 GPUs available for GPU parallel programming using CUDA. The CUDA version on this node is 7.0, and it includes the CUDA Toolkit and Computing SDK.

Many software packages are already optimized to take advantage of GPU accelerators, including MATLAB and Gaussian. A more comprehensive list of supported software is available from the NVIDIA Website. Note: this list is provided by the GPU vendor, NVIDIA. Not all the packages listed available in the RCC. If you see one you are interested in that is not currently available, please let us know: support@rcc.fsu.edu.

Submitting GPU Jobs

To submit a job for the GPU Nodes, simply submit it to the general access gpu_q. An example submit script is below:

#!/bin/bash
#SBATCH -J "my_gpu_job"
#SBATCH -n 1
#SBATCH -p gpu_q
#SBATCH -t 00:30:00

srun -n1  ./a.out

where a.out is your executable.

Compiling GPU software using CUDA

You can compile your CUDA software on the HPC Login nodes, and then submit them to the gpu_q to be run on the CUDA processing nodes. To compile your CUDA software, login to the HPC, and load the CUDA library with the command:

$ module load cuda

This will allow the user to use all of the available resources for compiling and profiling CUDA programs, as well as have access to the non-compiled example programs found in the CUDA Computing SDK.

The nvcc compiler and associated profiling tools are located in the directory:

/usr/local/cuda-7.0/bin

and the example programs are located in:

/usr/local/cuda-7.0/samples

Compiling Sample Code

To compile a sample code package, copy the desired sample folder directory to your home directory. For example:

$ cp -r /usr/local/cuda-7.0/samples/1_Utilities/deviceQuery ~

When compiling, use the Makefile included with the samples:

/usr/local/cuda-7.0/samples/Makefile

Some Examples

In the following, we show you a few simple examples to get you a good start with CUDA C programming.

Example 1. Querying the GPU.

The following simple example devprop.cu asks for the number of GPUs on the node and queries the GPU device using the functions cudaGetDeviceCount() and cudaGetDeviceProperties() :

#include <stdio.h>
#include <cuda_runtime.h>

int main(int argc, char ** argv){

    cudaError_t error;
    printf("%s running...\n", argv[0]);
    int   devCount;
    cudaGetDeviceCount(&devCount);
    printf("number of devices: %d\n", devCount);
    cudaDeviceProp devProp;
    cudaGetDeviceProperties(&devProp, 0);
    printf("maxThreadsPerBlock = %d\n", devProp.maxThreadsPerBlock);
    printf("max block dimension (%d, %d, %d)\n", devProp.maxThreadsDim[0],
    devProp.maxThreadsDim[1], devProp.maxThreadsDim[2]);
    printf("max grid dimension (%d, %d, %d)\n", devProp.maxGridSize[0],
        devProp.maxGridSize[1], devProp.maxGridSize[2]);
    return 0;
}

The output should be something like the following:

./devprop running...
number of devices: 2
maxThreadsPerBlock = 1024
max block dimension (1024, 1024, 64)
max grid dimension (65535, 65535, 65535)

Example 2. Print Block and Thread Indices

The following example checkDim.cu invokes a CUDA kernel with a small 1-D grid of 4 blocks and a small 1-D block with 4 threads. The kernel function checkIdx() simply prints out the block and thread index for each thread.

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void checkIdx() {

     int tx = threadIdx.x;
     int ty = threadIdx.y;
     int tz = threadIdx.z;
     int bx = blockIdx.x;
     int by = blockIdx.y;
     int bz = blockIdx.z;

     printf("threadIdx (%d,%d,%d), gridIdx (%d,%d,%d)\n",
             tx,ty,tz,bx,by,bz);
}

int main(){

    int  nElem = 15;
    dim3 dimBlock(4,1,1);
    dim3 dimGrid( (nElem + dimBlock.x - 1)/dimBlock.x,  1, 1);

    printf("blockdim = (%d, %d, %d)\n", dimBlock.x, dimBlock.y, dimBlock.z);
    printf("griddim  = (%d, %d, %d)\n", dimGrid.x,  dimGrid.y,  dimGrid.z);

    checkIdx<<<dimBlock, dimGrid>>>();
    cudaDeviceReset();
    return 0;
}

The result should be some thing similar to

blockdim = (4, 1, 1)
griddim  = (4, 1, 1)
threadIdx (0,0,0), gridIdx (0,0,0)
threadIdx (1,0,0), gridIdx (0,0,0)
threadIdx (2,0,0), gridIdx (0,0,0)
threadIdx (3,0,0), gridIdx (0,0,0)
threadIdx (0,0,0), gridIdx (3,0,0)
threadIdx (1,0,0), gridIdx (3,0,0)
threadIdx (2,0,0), gridIdx (3,0,0)
threadIdx (3,0,0), gridIdx (3,0,0)
threadIdx (0,0,0), gridIdx (1,0,0)
threadIdx (1,0,0), gridIdx (1,0,0)
threadIdx (2,0,0), gridIdx (1,0,0)
threadIdx (3,0,0), gridIdx (1,0,0)
threadIdx (0,0,0), gridIdx (2,0,0)
threadIdx (1,0,0), gridIdx (2,0,0)
threadIdx (2,0,0), gridIdx (2,0,0)
threadIdx (3,0,0), gridIdx (2,0,0)

Example 3. Intra-Block Thread Synchronization

Threads within a block can be synchronized using the routine

__syncthreads();

The following code localVariable.cu shows the effect of thread synchronization within a block.

#include <stdio.h>
#include <cuda_runtime.h>
__global__ void kernel() {
    double  a = 2.71828;    //register variables, automatic
    double  c[100];         //local variable, automatic
    __shared__ double b;    //shared variable
    int  tx  = threadIdx.x; //register variable
    if (tx == 0) {
        b = 3.1415926f;
    }
    //__syncthreads();        // run with/without this line
    printf("id = %d, a=%7.5f, b=%9.7f\n", tx, a, b);
}

int main() {
    kernel<<<1,8>>>();
    cudaDeviceReset();
    return 0;
}

Note that in the kernel function kernel(), the line __syncthreads() was commented out for now. Compile this code using

nvcc  -o localVariable.out  localVariable.cu

and run it, you get things similar to

id = 0, a=2.71828, b=3.1415925
id = 1, a=2.71828, b=-1187.8298054
id = 2, a=2.71828, b=-1187.8298054
id = 3, a=2.71828, b=-1187.8298054
id = 4, a=2.71828, b=-1187.8298054
id = 5, a=2.71828, b=-1187.8298054
id = 6, a=2.71828, b=-1187.8298054
id = 7, a=2.71828, b=-1187.8298054

This result is wrong, because the shared variable b is initialized by the thread 0; however, without proper synchronization among the threads, all other threads print out a wrong value of b. Now uncomment the __syncthreads() line, recompile it. Runs it again, you get the correct results

id = 0, a=2.71828, b=3.1415925
id = 1, a=2.71828, b=3.1415925
id = 2, a=2.71828, b=3.1415925
id = 3, a=2.71828, b=3.1415925
id = 4, a=2.71828, b=3.1415925
id = 5, a=2.71828, b=3.1415925
id = 6, a=2.71828, b=3.1415925
id = 7, a=2.71828, b=3.1415925

Example 4. Timing A CUDA Kernel.

The following simple code time.cu contains a host routine cpuSecond() which returns the current system time accurate to a micro-second. This routine can be used to time a CUDA kernel:

#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <sys/time.h>

double cpuSecond( ) {
    double sec;
    struct timeval tp;
    gettimeofday(&tp, NULL);
    sec = (double) tp.tv_sec + (double) tp.tv_usec*1e-6;
    return sec; 
}

__global__ void wastime() {

    int    i;
    double t;
    int    id = threadIdx.x;
    for (i = 0; i < 10000; i++){
        t = sin(i*3.14/2.718) ;
        printf("id = %d, i = %d, t = %10.6f\n", id, i, t);
    }
    return;
}

int main(int argc, char** argv){

    double      tStart, tStop;
    cudaError_t error;
    printf("current time %f\n", cpuSecond());
    tStart = cpuSecond();
    wastime<<<1, 8>>>();
    error = cudaPeekAtLastError();
    if (error != cudaSuccess) {
        printf("cudaError: %s\n", cudaGetErrorString(error));
    }
    cudaDeviceSynchronize();
    tStop = cpuSecond() - tStart;
    printf("wasted time %10.6f seconds\n", tStop);
    return 0;
}

You will see results similar to the following:

current time 1458671293.889923
id = 0, i = 9488, t =  -0.101526
id = 1, i = 9488, t =  -0.101526
id = 2, i = 9488, t =  -0.101526
id = 3, i = 9488, t =  -0.101526
id = 4, i = 9488, t =  -0.101526
id = 5, i = 9488, t =  -0.101526
id = 6, i = 9488, t =  -0.101526
id = 7, i = 9488, t =  -0.101526
.... (many lines skipped here)
id = 0, i = 9999, t =   0.178276
id = 1, i = 9999, t =   0.178276  
id = 2, i = 9999, t =   0.178276  
id = 3, i = 9999, t =   0.178276
id = 4, i = 9999, t =   0.178276
id = 5, i = 9999, t =   0.178276
id = 6, i = 9999, t =   0.178276
id = 7, i = 9999, t =   0.178276
wasted time   1.188478 seconds 

Example 5. Moving Data between Host and Device.

The following code sum1d.cu first allocates memory for two 1-D arrays in the host memory using malloc(), it then initializes them using the host function init_data(). It next allocates device memory using cudaMalloc() before copies the two host arrays to the device memory using cudaMemcpy(). It then sums two arrays on the GPU using the kernel function sum_1D(), and copies the result back to the CPU again using cudaMemcpy(). Finally it computes the mean value of the summed array from the Host.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

void init_data(float* ip, int size) {
    int    i;
   time_t t;
   srand( (unsigned int) time(&t) );
   for (i = 0; i < size; i++) {
     ip[i] = (float) (rand() & 0xFF)/10.0f;
   }
   return;
}

__global__ void sum_1D(float* A_d, float* B_d, float* C_d,
                     int size) {

   int tx = threadIdx.x;
   int bx = blockIdx.x;
   int id = tx + bx*blockDim.x;
   if (id < size) {
      C_d[id] = A_d[id] + B_d[id];
   }
   return;
}

float mean_1D(float* C, int size) {
    int i;
    float tot = 0;
    for (i = 0; i < size; i++) {
       tot += C[i];
    }
    return tot/size;
}

int main() {

    int    nElem  = 1024;
    size_t nbytes = nElem*sizeof(float);
    float *A_h, *B_h, *C_h;
    float *A_d, *B_d, *C_d;
    int   i;

    A_h  = (float*) malloc(nbytes);
    B_h  = (float*) malloc(nbytes);
    C_h  = (float*) malloc(nbytes);
    init_data(A_h, nElem);
    init_data(B_h, nElem);

    cudaMalloc( (void **) &A_d, nbytes);
    cudaMalloc( (void **) &B_d, nbytes);
    cudaMalloc( (void **) &C_d, nbytes);

    cudaMemcpy(A_d, A_h, nbytes, cudaMemcpyHostToDevice);
    cudaMemcpy(B_d, B_h, nbytes, cudaMemcpyHostToDevice);

    sum_1D<<<2, 512>>>(A_d, B_d, C_d, nElem);
    cudaMemcpy(C_h, C_d, nbytes, cudaMemcpyDeviceToHost);
    cudaFree(A_d);
    cudaFree(B_d);
    cudaFree(C_d);

    for (i = 0; i < nElem; i++)
    {
        printf("C[%d] = %10.4f\n", i, C_h[i]);
    }
    printf("mean value of C is %10.4f\n", mean_1D(C_h, nElem));
    free(A_h);
    free(B_h);
    free(C_h);
    return 0;
}

The results will be things like the following

C[0] =    17.6000
C[1] =     7.8000 
C[2] =    42.2000
C[3] =    20.8000
C[4] =    11.8000
C[5] =    18.6000
C[6] =    36.0000
C[7] =    48.8000
....
C[1016] =    47.2000
C[1017] =    44.8000
C[1018] =    24.0000
C[1019] =    13.0000
C[1020] =    34.8000
C[1021] =    20.6000
C[1022] =    19.4000
C[1023] =    34.6000
mean value of C is    24.9256

Learning More about GPU Programming

If you are new to GPU programming, NVIDIA provides a few tools for getting started with CUDA on their website:

For the more advanced user, refer to the CUDA Best Practices Guide