Submitting HPC Jobs

NOTE: In July of 2015, we migrated to the SLURM scheduler. If you need help updating your MOAB/Torque scripts to Slurm, refer to our MOAB to Slurm Migration Guide.

This page describes how to submit a job to the High Performance Computing Cluster.

Overview

Anytime you wish to use the HPC, you must create a "job", and submit that job to one of our processing partitions. A partition represents a subset of our overall compute cluster that can run jobs. We use advanced scheduling software named Slurm to manage jobs and partitions. This software enables us to provide limited compute resources to a large campus community. Once queued, your job may not execute right away.

We provide many partitions for different groups of users. Any faculty or student with an RCC Account can submit jobs to our general access, backfill, quicktest, and condor partitions. For those research groups that have purchased resources on the HPC, we provide dedicated partitions with enhanced access to resources in the HPC cluster.

Getting Ready to Submit

Before submitting your job to the MOAB Job Scheduler, you must decide a few things:

  1. Which partition to submit your job to - To see a list of partitions you have access to, run rcctool my:partitions. If you have purchased HPC access, you will have your own dedicated partition(s). For all other users, you can choose between the genacc_q and backfill queues.
    • genacc_q is good for MPI jobs that will run for up to two weeks.
    • backill and backfill2 are better for shorter running (up to four hours) MPI jobs that may use more resources at a time.
    • condor is good for very long running jobs (up to 90 days) that do not need to use MPI.
    • quicktest is good for testing your code. The maximum execution time is 10 minutes, but jobs always start instantly or very quickly.
  2. How long your job needs to run - Choose this value carefully. If you choose a value too short, the system will kill your job before it completes. If you choose a value too long, the scheduler may delay the start of your job for longer than necessary. This value is limited to the maximum length of time allowed in each queue.
  3. How many compute nodes your job will need - This is the level of parallelization, and is also limited per user and per partition. If your job does not contain any parallelization code, or only runs on a single node, you may consider using our Spear or Condor systems to run your job.

Compiling Code to Run on the HPC

Jobs typically consist of compiled C, C++, or Fortran code, or scripts written in Java, Python, BASH, CSH, or TCSH.

In this example, you will compile a simple C program, submit it to the HPC genacc_q queue, and view the output. Start by logging into the HPC via SSH. Copy our example primes.c program to your home directory:

$ cp /panfs/storage.local/opt/examples/primes.c ~

If you run cat primes.c, you will see that this example is a simple C program for calculating the first 100,000 prime numbers. You can change the number calculated by editing the source code.

If you have your own source code you wish to use, refer to our guide for copying it into your HPC home directory.

To compile this into an executable C program, you can use the GCC compiler:

$ gcc -o countprimes primes.c

This will create a file in your home directory named countprimes, which you can run:

$ ./countprimes

It is important to note that when you run the job this way directly on the login node, you are not actually using the HPC. To use the HPC, you must submit the job to a queue. You can kill the execution by typing CTRL+C.

Submitting a Job to the HPC

Submitting your job consists of two steps:

  1. Create a submission script
  2. Run the sbatch command to submit your job

A submission script is simply a text file that contains your job parameters and the commands you wish to execute as part of your job. You can also load modules, set environmental variables, or other tasks inside your submission script.

For the countprimes example from above, you can use a text editor to create a submission script file named submit_count_primes.sh in your home directory with the following parameters:

#!/bin/bash

#SBATCH --job-name="letscountprimes"
#SBATCH -n 1
#SBATCH -p genacc_q
#SBATCH --mail-type="ALL"
#SBATCH -t 00:00:15

~/countprimes

The parameters in this file specify the following, in order:

  1. The name of your job is letscountprimes
  2. The number of cores this job will require is one.
  3. Submit this to the genacc_q partition
  4. Send an email to you when the job starts, when it completes, and if it aborts for any reason.
  5. You expect this job will take approximately, but no longer than, 15 minutes to run.

For a full list of Slurm parameters that are available when you submit jobs, refer to our reference guide.

The "shebang" line: #!/bin/bash line is always required, since this is a shell script. Any lines that come after the #sbatch parameters are commands to run as part of the job. In this example, there is only one command to run: ~/countprimes. Note that you must include the path to your binary executable on the last line if it is in your home directory ("~" is shorthand for your home directory). For system-wide tools such as Gaussian or Java, you may not need to include the path to the executable, but it is generally a good idea.

Save this script, and make it executable by running chmod:

chmod +x submit_count_primes.sh

Now, if you wish to submit this job, you can simply run:

sbatch submit_count_primes.sh

When you run this command, a number will appear. This is your job ID. Don't worry if you forget it; you can look it up again later. You should also receive an email indicating your job has been queued and is in pending status.

Now that you've submitted your job, it is in our genacc_q partition and will run as soon as resources become available. This can take anywhere from a few seconds to several days, depending on how many resources and how much time you expect your job to take. You will recieve an email when the job begins to run.

You can see if your job has started or not by running squeue -j JOB_ID. If you can't remember your job ID, you can run squeue -u USERNAME to see your queued and running jobs.

For more information about job management, refer to our reference guide.

Viewing Job Output

Eventually, your job will run and either complete or fail. When your job completes or fails, you will receive an email. Either way, an output file will appear in your home directory. The file name(s) will be the slurm-[JOB_ID] followed by the extension .out.

The slurm-[JOB_ID].out file contains the output from your job. In this example, this will be a list of prime numbers. It will also contain any error output that the program generated.

Slurm allows you to separate your normal output from your error output if your workflow requires it. Simply add the following line to your submission script:

#SBATCH -e [SOME_FILE_NAME]

More Information

In this documentation page, we have covered the basics of submitting jobs to Slurm. For more options and information, refer to our reference guide.