HPC Benchmarks

GPU Benchmarks

We ran three different applications to benchmark a single GPU node containing four NVIDIA GeForce GTX 1080 Ti GPU cards. Therefore, the following results are valid for all of our GPU nodes. Our main objectives were to compare CPU vs. GPU performance and to see how well the performance scaled by adding multiple GPU cards to the job runtime configuration. Please note that your specific job may or may not give you the exact same performance gain, and please consider the presented results as a rule-of-thumb. The exact performance depends on various factors such as the type of the job, parallizability, memory requirements, etc.

NAMD

NAMD is a parallel molecular dynamics application with built-in GPU support. We ran the NAMD apoa1 benchmark on CPU only and CPU with multiple GPUs to compare perfeormance. Results are shown in Fig 1. The figure shows that adding a single GPU increases the runtime by an order or magnitude and that adding additional GPUs has diminishing returns. With all four GPU cards, we were able to obtain more than 20x performance over CPU-only runtime performance.

Fig. 1 NAMD apoa1 benchmark: The runtime of the benchmark with CPU only (GPU = 0) and 1-4 GPUs (lower is better). The relative speedup compared to CPU-only run is also shown on top of each bar.

LuxMark

LuxMark is an OpenCL-based image rendering benchmark tool. We used two different scenes and obtained linear scaling when we added multiple GPU cards as shown in Fig 2. There is no CPU-only comparison, so the number on each bar shows the performance gain compared to single GPU run. The scene "Hotel" is more complicated  than "Luxball" and hence it has a lower score, although they both scale linearly across multiple GPUs.

Fig. 2 LuxMark benchmark: Y-axis shows the benchmark score (larger is better), and the numbers in the columns show the relative performance compared to a single GPU

TensorFlow

Because GPUs are becoming increasingly popular in machine learning research, we also ran some Tensorflow benchmarks. We compiled GPU-enabled Tensorflow version 1.8.0 from source. The benchmark we chose used convolutional neural networks (CNN) for training on large numbers of images. The results are shown in Fig. 3. This shows that some benchmarks (inception3 and resnet50) scale in a near-linear manner while some (vgg16) benchmarks do not.

Fig. 3 Machine learning benchmarks: We used three different Tensorflow CNN benchmarks with 32 and 64 batch sizes.

Compute node benchmarks

To determine the computational performance and scaling of systems available on the HPC, we used the benchmarking tools available from WRF, the Weather Research and Forecasting model, found here. In our tests, we looked at the GFlops for three different compilers: GNU, Intel, and PGI, coupled with either OpenMPI or MVAPICH2, giving us a total of six combinations. Each of these is tested across 4, 16, 64, 144, 256, 400, and 576 processors. The scaling of each of these compilers is shown for all of the available hardware on the HPC, which is categorized according to the year that it was introduced (2012 through 2019).

We show two variants of the same data below in the benchmark results section. First, we show how each compiler combination (e.g., GNU Open MPI, Intel MVAPICH2, PGI Open MPI, etc.) changes across upgrades, so we have six plots. One of these plots details the scaling across every year's hardware for one compiler. Then, we show the inverse, where four plots correspond to each year's hardware. One of these plots shows the compiler performance for that year.

WRF Configuration

WRF provides two example data sets used specifically for benchmarking purposes. We use the lower resolution data set (12km CONUS, Oct. 2001) in our tests. We follow most of the instructions outlined on the WRF benchmarking page. During the configuration stage, we use the dmpar options for each compiler under Linux x86_64, using basic nesting. We then modify the configure.wrf file to change a few compiler flags, where most importantly we remove the flags -f90=$(SFC) and -cc=$(SCC), which ensures that distributed memory parallel processing is used rather than shared memory parallel processing. After configuring and compiling, we submit a SLURM script with some combination of the options mentioned in the Introduction (compiler, number of processors, and hardware year). We also modify the namelist.input file to account for the number of processors in each dimension (the values nproc_x and nproc_y). Following the successful execution of the program, the results are recorded in a publicly available directory:

/gpfs/research/software/benchmarks/benchmarks.old/publish/results

Specific results can be found in the subdirectories. For example, to find the results for GNU Open MPI for the hardware year 2010 using 4 processors, these would be located in the subdirectory:

GNU_OPENMPI/gnumpi_2010_4/

Specific timing information will be found in the file rsl.error.0000 in each of the subdirectories. To find the GFlops, we use the stats.awk program provided by WRF, using the command

grep 'Timing for main' rsl.error.0000 | tail -149 | awk '{print $9}' | awk - stats.awk

However, this command is contained in the python script used to calculate the GFlops for all configurations. A MATLAB script is then used to plot these results.

WRF Availability for Users

All of the above described post-processing tools are available to RCC users, found in the following directory:

/gpfs/research/software/benchmarks/benchmarks.old/tools

They are also available via cloning our git repository, described at the end of this section. Additionally, there should be no need to reconfigure/recompile WRF when benchmarking on the HPC for the six compiler combinations described above, as these are already available in the directory

/gpfs/research/software/benchmarks/benchmarks.old/benchmarks

To use these tools, some slight modifications will be required in order to properly place the output from the WRF benchmark tests and post-processing into a user's home directory, which will be outlined after explaining each tool.

The submit script submitJob.sh creates a new folder, adds all the necessary symbolic links, then creates and submits a MOAB script for the job. This script takes three (and one optional) command-line arguments: the compiler combination, the year of the machines being tested, and the number of processors requested. The additional argument is the estimated time for the job to complete, which is 02:00:00 by default (the amount of time required for a 4-processor job) but can be reduced if more processors are being used. It may be useful to use the command head submitJob.sh, since all of these parameters are briefly explained in the first few lines of the script. After completion of the job, there will be two large files in the output folder which are unnecessary to keep and will be identical regardless of the job configuration. These files are wrfout_d01_2001-10-25_00_00_00 and wrfout_d01_2001-10-25_03_00_00, and should be deleted to save space.

The two other scripts were used primarily to generate the figures shown below, though some users may find them useful. The scanResults.py script crawls through all of the simulation results and finds the timing information, making use of the calcGF.sh and stats.awk files, which aren't meant to be directly used by the user. This script outputs a file that contains the compiler combination, the year, the number of processors, the average time per simulation time step, and the speed in GFlops for each job configuration. The generateFigures.m MATLAB script then uses this timing information to plot the data.

Here we show how to modify each of these scripts to output data to a user's home directory. Note that each of these scripts must be copied to the user's home directory, and may need to be given execute permissions. Though the following folder configuration may not be ideal for all users, they aid in explaining the basics of what folder paths need to be changed in each script.

  • Create directory: $HOME/WRF
  • Create subdirectories: $HOME/WRF/figs $HOME/WRF/tools $HOME/WRF/output
  • Copy the tools from the public benchmark directory to your private directory: cp /gpfs/research/software/benchmarks/benchmarks.old/tools/* $HOME/WRF/tools/
  • Edit submitJob.sh: Change the userOutDir variable to point to the output subdirectory in the WRF directory above ($HOME/WRF/output). Also the queue may need to be changed. Default is set to backfill.
  • Edit calcGF.sh: Change publishedResults variable to point to the output subdirectory.
  • Edit scanResults.py: The compilers, years, and processors arrays may need to be changed to reflect your suite of test configurations. No directory variables need to be modified for this file.
  • Edit generateFigures.m: Change the figureFolder variable to point to the figs subdirectory. The compilerNames, titleNames, fileNames, and years arrays may also need to be changed here as well.

To test any changes, a recommended configuration for the submit script is GNU OpenMPI on the 2010 machines with 4 or 16 processors. These jobs should start and complete relatively quickly, even if the backfill or general access queue is chosen.

Finally, these tools can be found on the RCC BitBucket page:

https://bitbucket.org/fsurcc/wrf-benchmarks

The specific configuration and compilation steps to set up WRF are available if the need arises, as well as some additional information on how to set up and run the tools described above. This repository can be cloned using the command:

git clone https://bitbucket.org/fsurcc/wrf-benchmarks

The same changes to the scripts described above will need to be made.

Benchmark Results By Year

The following graphs summarizes the performance of WRF on different hardware belong to HPC identified by the year in which they were brought to service (SLURM can be instructed to use only the nodes purchased in 2019 with the option --C="2019", for example). The y axis shows the performance in GFlops (1 Giga Flop is one million floating point operations per second) and the higher this number, the better the performance. The graph below shows the benchmark results for low core counts. These results are averaged over all compilers (gnu, intel, pgi and openmpi, and mvapich2 for a total of six compilers per each data point) and therefore shows the overall performance of different hardware types.

Benchmark Results By Compiler

The following is a breakdown of the above graph in to seperate compilers. Note that we are missing PGI compiler benchmarks for 2019 nodes. In summary, WRF performs best with Intel compilers (about two times performance gain compared to GNU compilers), and the PGI compiler performance is in-between GNU and Intel compilers. Also, OpenMPI seems to perform slightly better than mvapich2 with all the compilers.

GNU-OpenMPI

 

GNU-mvapich2

 

Intel-OpenMPI

Intel-mvapich2

PGI-OpenMPI

PGI-mvapich2

LAMMPS

LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is a molecular dynamics simulation package. The reason we chose this as a benchmark application is two-fold: It is widely used in FSU reserch community, and it works differently than WRF depending on the type of model used. We used the following four different bencharks provided as part of the LAMMPS package to measure the performance of our systems.

  1. LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55 neighbors per atom), NVE integration
  2. Chain = bead-spring polymer melt of 100-mer chains, FENE bonds and LJ pairwise interactions with a 2^(1/6) sigma cutoff (5 neighbors per atom), NVE integration
  3. EAM = metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration
  4. Rhodo = rhodopsin protein in solvated lipid bilayer, CHARMM force field with a 10 Angstrom LJ cutoff (440 neighbors per atom), particle-particle particle-mesh (PPPM) for long-range Coulombics, NPT integration

We installed the LAMMPS package compiled with every compiler-MPI combination we provide, and users can access these versions simply by loading the desired module (eg. intel-mvapich2 module will give you access to the intel-mvapich2 based LAMMPS version).

LJ Benchmarks

Following graph summerizes LJ benchmark results over all compilers. It is noticiable that 2019 nodes perform significantly better (~40%) than other nodes and this is a direct result of better hardware. Notice that WRF only had a slight performance gain on the same nodes. The conclusion here is that the exact gain depends on the type of job. As you can see later, even different types of LAMMPS jobs have different performance.

The following graphs show performance of individual compilers.

GNU-OpenMPI

GNU-mvapich2

Intel-OpenMPI

Intel-mvapich2

Chain Benchmarks

EAM Benchmarks

Rhodo Benchmarks

Summary

As you can see, the exact performance varies with the compiler, hardware, as well as the type of job. Therefore, the results shown here should only be used as a rule-of-thumb when you compare the performance of your job with them.

Also, our tests show that older hardware are not necessarily bad (for example, the 2014 nodes perform better than some newer nodes when running LAMMPS!). We are planning to add a third benchmark so that our users can get a more complete picture of our hardware capabilities.

We encourage you to use these as a guide to assess the performance of your jobs (even if you do not run WRF or LAMMPS). Almost all of these tests were run during annual maintenance downtime or as soon as new sets of hardware are built and therefore the interference from other jobs running in the cluster was minimized. Please understand that this is not the case when you run a job in the cluster.

Category