In 2015, we migrated from the MOAB/Torque scheduler to the Slurm scheduler. Below is a copy of our guide for archival purposes
In June of 2015, the RCC will replace our MOAB/Torque scheduler software with another product, Slurm. Refer to our news announcement for more information about the reason for the switch. The new scheduler will necessitiate major changes to your submission scripts and job management commands. This guide will provide most of the essential information needed to make the change.
'Queues' are now 'Partitions'
Slurm operates very similar to the way that MOAB/Torque worked. It is important, however, to understand a few key differences in the two platforms. Most importantly, Slurm refers to queues as partitions. Partitions operate the same way that queues did, and you will notice little difference when using them. Instead of submitting a job to a queue, you will simply submit it to a partition.
All of the queues that existed in MOAB (e.g. genacc_q, backfill_q, etc) will exist as partitions in Slurm with the same name, parameters, and resource allocations. The only exception to this is that we are combining the
backfill2 queues into a single partition. Previously these two MOAB queues each allowed you to submit jobs to up to half of the nodes in our system. The new backfill Slurm partition will allow you to submit jobs to the entire cluster.
One new feature is the ability to specify multiple partitions when you submit a job. If you do this, Slurm will run your job on the first partition where resources become available with your specified parameters. Simply separate the partition names with commas when you submit your job, or in your submission script. For example:
# sbatch my_script.sh -p genacc_q,backfill (comma separated, no spaces between)
Using the Slurm Test Cluster
We have provided a test cluster for your use. This will allow you to update and test your job submission scripts, and practice using job management commands to manage your HPC jobs. You can login to the test cluster at the following address:
This test cluster consists of one login node and ten compute nodes. All the basic scheduler functionality exists, and it mounts both Panasas and Lustre.
We have developed a "cheet sheat" to assist in the transition from MOAB to Slurm. Please use this as a reference when updating your scripts and running commands.
Updating Submission Scripts
You will need to update most of the parameters in your job submission scripts to use the new Slurm syntax. As an example, consider this MOAB script:
#!/bin/bash #MOAB -N "letscountprimes" #MOAB -l nodes=1 #MOAB -j oe #MOAB -q genacc_q #MOAB -m abe #MOAB -l walltime=00:00:15 $PBS_O_WORKDIR/countprimes
The corresponding Slurm script will look like this:
#!/bin/bash</span> #SBATCH -J "letscountprimes" #SBATCH -n 1 #SBATCH -o letscountprimes.o #SBATCH -p genacc_q #SBATCH --mail-type=ALL #SBATCH -t 15:00 $SLURM_SUBMIT_DIR/countprimes
Notice that the
#MOAB directives are now
$SLURM_SUBMIT_DIR. Parameter can have many options, and they are documented in the official Slurm documentation.
List of MOAB Parameters and Slurm Equivalents
The following table lists common MOAB parameters in submit scripts and their corresponding Slurm equivalents. Looking for something that's not listed here? Refer to the Slurm Rosetta Stone and the parameter list in the Slurm documentation.
|Job Array Index||$PBS_ARRAYID||$SLURM_ARRAY_TASK_ID|
|Queue||-q [queue_name]||-p [partition_name]|
|_ Node Count||-l nodes=[count]||-N [min[-max]]|
|CPU Count||-l ppn=[count]||--ntasks-per-node [count]|
|Wall Clock Limit||-l walltime=[hh:mm:ss]||-t [days-hh:mm:ss]|
|Standard Output File||-o [file_name]||-o [file_name]|
|Standard Error File||-e [file_name]||-e [file_name]|
|Combine Error/Output FIle||-j oe||(use -o without -e)|
|Copy Environment||-V||--export[ ALL | NONE | vars...]|
|Event Notification (email)||-m abe||--mail-type=[ ALL | events...]|
|Custom Email Address||-M [address]||--mail-user=[address]|
|Job Name||-N [name]||-J [name]|
|Job Restart||-r [y|n]||--requeue _or_ --no-requeue|
|_ Memory Size||-l mem=[MB]||--mem=[mem][M|G|T]|
|Memory Size per CPU||-l pmem=[MB]||--mem-per-cpu=[mem][M|G|T]|
|Job Dependency||-d [job_id]||--depend=[state:job_id]|
|Begin Time||-A "YYYY-MM-DD HH:MM:SS"||--begin=YYYY-MM-DD[THH:MM[:SS]]|
Submitting and Managing Jobs
The commands that you use to submit and manage jobs on the HPC are different for Slurm than they were for MOAB. To submit jobs, you will now use the
srun commands. To check job status, you will most commonly use the
For example, you may wish to submit a job and then check its status:
$ # Submit a job $ sbatch myjob_submit.sh $ Check job status for my running jobs $ squeue -u cam02h
List of Common MOAB Commands and Slurm Equivalents
The following table lists common MOAB commands and their Slurm Equivalents. For a comprehensive list, refer to the Slurm Documentation.
|Submit a job||msub [script_file]||sbatch [script_file]|
|Cancel a job||canceljob [job_id]||scancel [job_id]|
|Cancel pending jobs for user||canceljob ALL||scancel -u [user_name] -t PENDING|
|Delete a job||qdel [job_id||scancel [job_id]|
|Show job status (by job)||qstat [job_id]||squeue -j [job_id]|
|Show job status (by owner)||showq -u [user_name]||squeue -u [user_name]|
|Show queue/partition status||showq -q [queue_name]||squeue -p [partition_name]|
You can always get help for any command by using the --help flag, or more detailed information by typing man command_name. For example:
$ scancel --help Usage: scancel [OPTIONS] [job_id[_array_id][.step_id]] -A, --account=account act only on jobs charging this account -b, --batch signal batch shell for specified job -i, --interactive require response from user for each job -n, --name=job_name act only on jobs with this name -p, --partition=partition act only on jobs in this partition -Q, --quiet disable warnings -q, --qos=qos act only on jobs with this quality of service -R, --reservation=reservation act only on jobs with this reservation -s, --signal=name | integer signal to send to job, default is SIGKILL -t, --state=states act only on jobs in this state. Valid job states are PENDING, RUNNING and SUSPENDED -u, --user=user_name act only on jobs of this user -V, --version output version information and exit -v, --verbose verbosity level -w, --nodelist act only on jobs on these nodes --wckey=wckey act only on jobs with this workload charactization key Help options: --help show this help message --usage display brief usage message
List of Slurm Job States
Similar to MOAB, Slurm jobs can have different states. Below is a table listing MOAB job states and their Slurm equivalents:
|Job State||MOAB/Torque||Slurm Code||Slurm Name||Description|
|Pending||idle||PD||PENDING||Job is queued; waiting to run|
|Preempted||deferred||PR||PREEMPTED||Job terminated due to preemption|
|Configuring||starting||CF||CONFIGURING||Job has been allocated resources, but is waiting for them to become ready|
|Running||running||R||RUNNING||Job has allocation and is running.|
|Completing||-no equivalent-||CG||COMPLETING||Job is in the process of completing; some nodes may still be active|
|Completed||completed||CD||COMPLETED||Job has terminated all processes succesfully|
|Cancelled||removed||CA||CANCELLED||Job was explicitely cancelled by user or administrator before or during running.|
|Suspended||suspended||S||SUSPENDED||Job has allocation, but execution has been suspended.|
|Timeout||removed||TO||TIMEOUT||Job has reached its time limit (maximum allowable time limit per partition, or time limit specified by user upon submission).|
|Failed||removed||F||FAILED||Job terminated with non-zero exit code or other failure condition.|
|Node Failed||vacated||NF||NODE_FAIL||Job terminated dueo to failur of one or more allocated nodes.|
Examples of Common Tasks
List partitions and their resources:
scontrol show partition
Show available resources to General Access partition jobs:
scontrol show partition genacc_q</pre>
Show all jobs running in the Backfill partition:
squeue -p genacc_q
Show all pending jobs in the General Access partition owned by user 'john':
squeue -u john -t PENDING
Suppose you submitted job with ID #123456. If the job is not running, you can check its future eligibiltiy:
scontrol show jobid -dd 123456
Cancel job with ID #123456:
Getting Help and Troubleshooting
We understand that not all transitions will go smoothly. To help, RCC staff will provide one-on-one support for anybody who needs it when updating your scripts and worfklow. Simply submit a support ticket and let us know that you would like us to help.