Job Management in MOAB

Once you have submitted a job to the MOAB queue, you can monitor, cancel, and troubleshoot that job.

Common MOAB commands

Command Explanation
showq Shows all jobs that are queued and running the MOAB system.
-r shows only running jobs plus additional information such as qos, account, and start time.
-i shows only idle jobs.
-b shows blocked jobs.
-c shows recently completed jobs.
-u uname shows jobs submitted by user uname.
-w shows jobs with specified constraints (or produce filtered job report). Valid constraints include user, class, group, acct, and qos.
qstat Show status of batch jobs.
[-u user] for a specific user
[-q queue_name] e.g., qstat -q genacc_q
[job_id] for a specific job
checkjob [job_id] Checks the status of a job.
-v flag for detailed output.
--flags=future Evaluates future eligibility of a job (ignore current resource state and usage limitations).
-n [node_ID] Checks job access to specified node and preemption status with regards to jobs located on that node.
canceljob [job_id1, job_id2, ...] Cancel the specified job.
-h Display usage information
ALL Cancel all jobs submitted by a user (use with caution).
mjobctl -c [job_id] Cancel a job.
mjobctl -c -w [Attr=Value] Cancel jobs based on given job attribute. Attributes can be [user | account | qos | class | state | jobName]
mjobctl -F [job_id] Cancel a job by force (more powerful than the mjobctl -c command).
mjobctl -h [HOLDTYPE] [job_id] Place a job in hold. The HoldType can be {user|batch|system|defer|ALL}. Normal user use only user.
mjobctl -u [HOLDTYPE] [job_id] unset a job hold (reverse operation to mjobctl -h). Note: If you are a normal user, and you job is held by the admin or the resource manager, you can not release the hold by this command.
mjobctl -m ATTR { += | = | -= } VAL change a job's parameters.
qdel [job_id] Cancel job using 'qdel', prepend qdel by /opt/torque/bin/ if command not found.
showstart [job_id] Attempt to provide an estimate of when a queued job will start*
showbf Show available system resources in a particular queue or for a particular group (see examples below)

* Job estimates are generally unreliable, because most users don't provide the system with an estimate of how long their jobs will run for.

Refer to the official MOAB Documentation for a more comprehensive list of command for job management.

Example Usage of MOAB JOB Managing Commands:

Show available resources to General Access queue jobs:

showbf -q genacc_q

Show all jobs running in the Backfill queue:

showq -r -w class=backfill

Show all idle (pending) jobs in the General Access queue owned by user john:

showq -i -w class=genacc_q  -w user=john

Suppose you submitted a job, and the job id is 7584526. If the job is not running, you can check its future eligibility using

checkjob --flags=future  7584526

Suppose you submitted a job requiring a specific node, say, hpc-12-2, and the job id is 7584526. You can check the job/node status using

 checkjob -n  hpc-12-2  7584526

To cancel job with job id 7584526

canceljob 7584526

To cancel all jobs in the state USERHOLD or BATCHHOLD

mjobctl -c -w  state=USERHOLD  -w  state=BATCHHOLD

Job States

You can monitor your jobs using commands discussed above such as qstat, showq, and checkjob. To better understand the output of these commands, you need to understand the state of your job. The job's state indicates its current status and eligibility for execution. In the following we list possible job states:

State Description
idle Job is currently queued and eligible to run but is not executing.
hold Job is idle and is not eligible to run due to user, admin, or batch system hold (refer to the job hold section of this page).
deferred Job is held by Moab due to an inability to schedule the job under current conditions (intermediate step before put in batch hold).
starting Batch system has attempted to start the job and the job is currently performing pre-start tasks (e.g., executing system pre-launch scripts).
running Job is currently executing the user application.
suspended Job was running but has been suspended by the scheduler or an administrator; user application is still in place on the allocated compute resources, but it is not executing.
canceling Job has been canceled and is in process of cleaning up.
completed Job has completed running without failure.
removed Job has run to its requested walltime successfully but has been canceled by the scheduler or resource manager due to exceeding its walltime or violating another policy; includes jobs canceled by users or administrators either before or after a job has started.
vacated Job canceled after partial execution due to a system failure.

Job Eligibility

If a submitted job can not run, it might be caused by one of the following reasons:

Reason Description
job has hold in place one or more job holds are currently in place
insufficient idle procs there are currently not adequate processor resources available to start the job
idle procs do not meet requirements adequate idle processors are available but these do not meet job requirements
start date not reached job has specified a minimum start date which is still in the future
expected state is not idle job is in an unexpected state
state is not idle job is not in the idle state
dependency is not met job depends on another job reaching a certain state
rejected by policy job start is prevented by a throttling policy

If a job can not run on a specific node, it might be caused by following reasons:

Reason Description
Class Node does not allow required job class/queue
CPU Node does not possess required processors
Features Node does not possess required node features
Memory Node does not possess required real memory
State Node is not Idle or Running

Job Holds

A submitted job can be held by users, administrators, resource managers.

JobHold Description
UserHold Holds created by, and under the control of a non-previleged user, and maybe removed at any time by the user.
SystemHold Holds handled by system administrator, normal user cannot release a system hold even on their own jobs.
BatchHold Holds placed on a job by the schedular itself when it determines that a job cannot run.
Deferred A intermediate hold state before entering the BatchHold state

FAQs

(1). How many jobs can I submit simultaneously?

Depends on the queue you are using. Refer to HPC_Queues for details.

(2). How to check the status of my jobs?

You can use

     showq -u  username         # for all your jobs
     qstat                                  # for all your jobs
     checkjob   job_id               # for a specific job.

(3). Whey will my job start?

Depends on the resource available and the amount of resources you have requested. It also depends on your priority (you priority will decrease after you have submitted more jobs). A rough estimate can be made by the scheduler using

    showstart job_id

(4). Why my job is not running?

There can be many causes for this problem. You job can be put in idle/blocked states for a while when there is no adequate resource available. It can also be put on batchhold when you violated, say, queue policy. Please check the Job Eligibility and Job Holds Section above for details.

(5). Can I modify my job parameters after submitting my jobs?

Yes. Some parameters can be modified after job submission. The mjobctl -m command allows you the change some job parameters. For example, the following command increase the wall clock limit by 4 hours (provided the total is within the limit allowed by the queue you are using):

   mjobctl -m wclimit+=4:00:00