HPC Benchmarks

GPU Benchmarks

We ran different applications to benchmark a single GPU node containing four NVIDIA GeForce GTX 1080 Ti GPU cards. Therefore, the following results are valid for all of our GPU nodes. Our main objectives were to compare CPU vs. GPU performance and to see how well the performance scaled by adding multiple GPU cards. Please note that your specific job may or may not give you the exact same performance gain, and please consider the presented results as a rule of thumb. The exact performance depends on various factors such as the type of the job, parallizability, memory requiirements, etc.


NAMD is a parallel molecular dynamics application and has built in GPU support. We ran NAMD apoa1 benchmark on CPU only and CPU with multiple GPUs to compare perfeormance. Results are shown in Fig 1. It shows that adding a single GPU increases the runtime by an order or magnitude and adding more GPUs shows diminishing returns. With all four GPU cards, we were able to obtain more than 20x performance over CPU only runtime performance.

NAMD apoa1 benchmark results

Fig. 1 NAMD apoa1 benchmark: The runtime of the benchmark with CPU only (GPU = 0) and 1-4 GPUs (lower is better). The relative speedup compared to CPU-only run is also shown on top of each bar


LuxMark is an OpenCL based image rendering benchmark tool. We used two different scenes and obtained linear scaling when added multiple GPU cards as shown in Fig 2. There is no CPU-only comparison, and the number on each bar shows the performance gain compared to single GPU run. The scene "Hotel" is more complicated  than "Luxball" and hence it has a lower score, although they both scale linearly across multiple GPUs.

LuxMark benchmark results

Fig. 2 LuxMark benchmark: Y axis shows the benchmark score (the larger the better), and the numbers in the columns show the relative performance compared to a single GPU


Because GPUs are getting increasingly popular in machine learning research, we also ran some Tensorflow benchmarks. We compiled GPU-enabled Tensorflow version 1.8.0 from source. The benchmark we chose used convolutional neural networks (CNN) for training on large numbers of images. The results are shown in Fig. 3. This shows that some benchmarks (inception3 and resnet50) scale in a near-linear manner while some (vgg16) benchmarks do not.

Tensorflow benchmarks

Tensorflow benchmarks

Fig. 3 Machine learning benchmarks: We used three different Tensorflow CNN benchmarks with 32 and 64 batch sizes.

Compute node benchmarks

To find the computational performance and scaling of the systems available on the HPC, we used the benchmarking tools available from WRF, the Weather Research and Forecasting model, found here. In our tests, we looked at the GFlops for three different compilers: GNU, Intel, and PGI, coupled with either OpenMPI or MVAPICH2, giving us a total of six combinations. Each of these is tested across 4, 16, 64, 144, 256, 400, and 576 processors. The scaling of each of these compilers is shown for all of the available hardware on the HPC, which is split up according to the year in which it was introduced ( 2012, 2013, 2014, and 2015).

In the Benchmark Results section, we show two variants of the same data. First we show how each compiler combination (ex: GNU Open MPI, Intel MVAPICH2, PGI Open MPI, etc.) changes across upgrades, so we have six plots where one of these plots details the scaling across every year's hardware for one compiler. Then we show the inverse, where four plots corresponding to each year's hardware are shown where one of these plots shows the compiler performance for that year.

WRF Configuration

WRF provides two example data sets used specifically for benchmarking purposes. We use the lower resolution data set (12km CONUS, Oct. 2001) in our tests. We follow most of the instructions outlined on the benchmarking page. During the configuration stage, we use the dmpar options for each compiler under Linux x86_64, using basic nesting. We then modify the configure.wrf file to change a few compiler flags, where most importantly we remove the flags -f90=$(SFC) and -cc=$(SCC), which ensures that distributed memory parallel processing is used rather than shared memory parallel processing. After configuring and compiling, we submit a SLURM script with some combination of the options mentioned in the Introduction (compiler, number of processors, and hardware year). We also modify the namelist.input file to account for the number of processors in each dimension (the values nproc_x and nproc_y). Following the successful execution of the program, the results are recorded in a publicly available directory:


Specific results can be found in the subdirectories. For example, to find the results for GNU Open MPI for the hardware year 2010 using 4 processors, these would be located in the subdirectory:


Specific timing information will be found in the file rsl.error.0000 in each of the subdirectories. To find the GFlops, we use the stats.awk program provided by WRF, using the command

grep 'Timing for main' rsl.error.0000 | tail -149 | awk '{print $9}' | awk - stats.awk

However, this command is contained in the python script used to calculate the GFlops for all configurations. A MATLAB script is then used to plot these results.

WRF Availability for Users

All of the above described post-processing tools are available to RCC users, found in the directory


or available via cloning our git repository, described at the end of this section. Additionally, there should be no need to reconfigure/recompile WRF when benchmarking on the HPC for the six compiler combinations described above, as these are already available in the directory


To use these tools, some slight modifications will be required in order to properly place the output from the WRF benchmark tests and post-processing into a user's home directory, which will be outlined after explaining each tool.

The submit script submitJob.sh creates a new folder, adds all the necessary symbolic links, then creates and submits a MOAB script for the job. This script takes three (and one optional) command-line arguments: the compiler combination, the year of the machines being tested, and the number of processors requested. The additional argument is the estimated time for the job to complete, which is 02:00:00 by default (the amount of time required for a 4-processor job) but can be reduced if more processors are being used. It may be useful to use the command head submitJob.sh, since all of these parameters are briefly explained in the first few lines of the script. After completion of the job, there will be two large files in the output folder which are unnecessary to keep and will be identical regardless of the job configuration. These files are wrfout_d01_2001-10-25_00_00_00 and wrfout_d01_2001-10-25_03_00_00, and should be deleted to save space.

The two other scripts were used primarily to generate the figures shown below, though some users may find them useful. The scanResults.py script crawls through all of the simulation results and finds the timing information, making use of the calcGF.sh and stats.awk files, which aren't meant to be directly used by the user. This script outputs a file that contains the compiler combination, the year, the number of processors, the average time per simulation time step, and the speed in GFlops for each job configuration. The generateFigures.m MATLAB script then uses this timing information to plot the data.

Here we show how to modify each of these scripts to output data to a user's home directory. Note that each of these scripts must be copied to the user's home directory, and may need to be given execute permissions. Though the following folder configuration may not be ideal for all users, they aid in explaining the basics of what folder paths need to be changed in each script.

  • Create directory: $HOME/WRF
  • Create subdirectories: $HOME/WRF/figs $HOME/WRF/tools $HOME/WRF/output
  • Copy the tools from the public benchmark directory to your private directory: cp /panfs/storage.local/src/benchmarks/tools/* $HOME/WRF/tools/
  • Edit submitJob.sh: Change the userOutDir variable to point to the output subdirectory in the WRF directory above ($HOME/WRF/output). Also the queue may need to be changed. Default is set to backfill.
  • Edit calcGF.sh: Change publishedResults variable to point to the output subdirectory.
  • Edit scanResults.py: The compilers, years, and processors arrays may need to be changed to reflect your suite of test configurations. No directory variables need to be modified for this file.
  • Edit generateFigures.m: Change the figureFolder variable to point to the figs subdirectory. The compilerNames, titleNames, fileNames, and years arrays may also need to be changed here as well.

To test any changes, a recommended configuration for the submit script is GNU OpenMPI on the 2010 machines with 4 or 16 processors. These jobs should start and complete relatively quickly, even if the backfill or general access queue is chosen.

Finally, these tools can be found on the RCC BitBucket page,


where the specific configuration and compilation steps to set up WRF are available if the need arises, as well as some additional information on how to set up and run the tools described above. This repository can be cloned using the command

git clone https://bitbucket.org/fsurcc/wrf-benchmarks

but the same changes to the scripts described above will need to be made.

Benchmark Results By Year

The following plot shows the average run times over all compilers for each year. It shows 2014 and 2015 nodes have better performance than 2012 and 2012 nodes.

Benchmark Results By Compiler

For each hardware configuration (year), the top graph shows run time for each compiler and the bottom graph plots the ratio of each run time devided by the run time of gnu-openmpi compiler based WRF. Purpose of this second plot is to better compare compiler performance. This plot clearly shows GNU (openmpi and mvapich2) compilers are 1.5-2 times slower than Intel and PGI compilers.

2012 Nodes (AMD)

2013 Nodes (Intel)

2014 Nodes (Intel)

2015 Nodes (Intel)