Submitting HPC Jobs

This page describes how to submit a job to the High Performance Computing Cluster.

Overview

Anytime you wish to use the HPC, you must create a "job", and submit that job to one of our processing partitions. A partition represents a subset of our overall compute cluster that can run jobs. We use advanced scheduling software called Slurm to manage jobs and partitions. This software enables us to provide limited compute resources to a large campus community.

Your job may not start right-away once you have submitted it.  Jobs are typically queued and processed according to an algorithm to ensure fair access for all users.  Your job will remain queued anywhere from a few seconds to a few hours, depending on how many resources it requires and how much other activity is occuring on the cluster.  The average wait time is a few minutes.

We provide many partitions for different groups of users. Any faculty or student with an RCC Account can submit jobs to our general access, backfill, backfill2, quicktest, and condor partitions. For those research groups that have purchased resources on the HPC, we provide dedicated partitions with enhanced access to resources in the HPC cluster.

Getting Ready to Submit

Before submitting your job to the Slurm Job Scheduler, you must decide a few things:

  1. How long your job needs to run - Choose this value carefully. If you underestimate this value, the system will kill your job before it completes. If you overestimate this value, your job may wait in-queue for longer than necessary.  Generally, it is better to overestimate than to underestimate.
  2. How many compute cores and nodes your job will need - This is the level of parallelization.  If your program supports running multiple threads or processes, you will need to determine how many to allocate. 
  3. How much memory your job will need - By default, 3.9GB is allocated per each compute core you allocate (except in backfill and backfill2, where the value is 1.9GB per compute core).  This value is enough for most jobs, but you can increase this value if your job requires more memory.
  4. If your job needs access to special features - Your job may need access to GPUs or to nodes with specific processor models.  You can specify these as constraints when you submit your job.

Choose a Partition

You must submit your job to a partition.  Each partition has different limitations and parameters.  If you have purchased resources on the HPC, you will have access to a dedicated partition.  All users have access to our general access partitions:

  • genacc_q is good for MPI jobs that will run for up to two weeks.
  • backill and backfill2 are better for short-running MPI jobs (up to four hours).
    • backfill has access to fewer resources, but your job is guaranteed to run for up to four hours.
    • backfill2 has access to the entire cluster, but your job may be preempted by other users at any time.
  • condor is good for very long running jobs (up to 90 days) that do not need to use MPI.
  • quicktest is good for testing your code. The maximum execution time is 10 minutes, but jobs always start within seconds of being submitted.

You can see a list of partitions along with their configuration on our website or by running the following command on the HPC:

$ rcctool my:partitions

Submitting a Job to the HPC

Submitting your job consists of two steps:

  1. Create a submission script
  2. Run the sbatch command to submit your job

A submission script is simply a text file that contains your job parameters and the commands you wish to execute as part of your job. You can also load modules, set environmental variables, or other tasks inside your submission script.

Example: Submitting trap-mpivh2

This example demonstrates how to submit the trap-mpivh2 application from our compilation guide.  To follow along, complete the steps in that guide and then return here to submit your job.

You can use a text editor (such as nano or vi) to create a submission script file named submit_trap.sh in your home directory with the following parameters:

#!/bin/bash

#SBATCH --job-name="trap"
#SBATCH -n 5
#SBATCH -p genacc_q
#SBATCH --mail-type="ALL"
#SBATCH -t 00:00:15

module load intel
module load intel-mvapich2

~/trap-mpichv2

The parameters in this file specify the following, in order:

  1. The name of your job is trap
  2. This job will use five CPU cores
  3. This job will run in the genacc_q partition
  4. The job scheduler will send you an email when the job starts and finishes, or if it aborts for any reason
  5. You expect this job will take approximately, but no longer than, 15 minutes to run.  The scheduler will abort the job after 15 minutes.

The "shebang" line: #!/bin/bash line is always required, since this is a shell script.  Each line that begins with #SBATCH are parameters to be passed to the Slurm scheduler.  For a full list of Slurm parameters that are available when you submit jobs, refer to our reference guide, or run the following on the HPC:

$ man sbatch

Any lines below the #SBATCH parameters are commands to run as part of the job. In this example, we run two commands:

  1. We load the Intel module to make Intel libraries available during the job runtime.
  2. We run the ~/trap-mpichv2 command.  This will run on the compute nodes that have been allocated by the Slurm scheduler. 

Note that you must include the path to your binary executable ("~" is shorthand for your home directory). For system-wide tools such as Gaussian or Java, you may not need to include the path to the executable, but it is generally a good idea.  You can use the which command to determine the path to any system-wide tool; e.g.:

$ which java

Save this script and make it executable by running chmod:

chmod +x submit_trap.sh

Now, if you wish to submit this job, you can simply run:

sbatch submit_trap.sh

When you run this command, a number will appear. This is your job ID. Don't worry if you forget it; you can look it up again later. You should also receive an email indicating your job has been queued and is in pending status.

Your job is now queued in our genacc_q partition and will run as soon as resources become available. This can take anywhere from a few seconds to several days, depending on how many resources and how much time you expect your job to take. You will recieve an email when the starts.

You can see if your job has started or not by running squeue:

# Use the Job ID to get your job status: e.g.
$ squeue -j 12345

# If you can't remember your JOB ID, you can run this:
$ squeue -u `whoami`

If you do not see your job listed, it has either finished, been cancelled, or failed.

For more information about job management, refer to our reference guide.

Viewing Job Output

Eventually, your job will run and either complete or fail. When your job completes or fails, you will receive an email. Either way, an output file will appear in your home directory. The file name(s) will be the slurm-[JOB_ID] followed by the extension .out.  You can view this file with common Linux tools, such as more, lesscat or editors such as vim or nano.

$ more slurm-123456.out

The slurm-[JOB_ID].out file contains the output from your job. In this example, this will be the results of the trap application. It will also contain any error output that the program generated.

By default, Slurm sends error and standard output to the same file.  In other words, it combines the STDOUT and STDERR streams into a single output.  You can separate your standard output from your error output if your workflow requires it. Simply add the following line to your submission script:

#SBATCH -e [SOME_FILE_NAME]

More Information

In this documentation page, we have covered the basics of submitting jobs to Slurm. For more options and information, refer to our reference guide.