Once you have submitted a job to the MOAB queue, you can monitor, cancel, and troubleshoot that job.
Common MOAB commands
||Shows all jobs that are queued and running the MOAB system.|
|-r||shows only running jobs plus additional information such as qos, account, and start time.|
|-i||shows only idle jobs.|
|-b||shows blocked jobs.|
|-c||shows recently completed jobs.|
|-u uname||shows jobs submitted by user
|-w||shows jobs with specified constraints (or produce
||Show status of batch jobs.|
|[-u user]||for a specific user|
|[-q queue_name]||e.g., qstat -q genacc_q|
|[job_id]||for a specific job|
||Checks the status of a job.|
|-v||flag for detailed output.|
|-n [node_ID]||Checks job access to specified node and preemption status with regards to jobs located on that node.|
||Cancel the specified job.|
|-h||Display usage information|
|ALL||Cancel all jobs submitted by a user (use with caution).|
||Cancel a job.|
||Cancel jobs based on given job attribute. Attributes can be [user | account | qos | class | state | jobName]|
||Cancel a job by force (more powerful than the
||Place a job in hold. The
||unset a job hold (reverse operation to
||change a job's parameters.|
||Cancel job using 'qdel', prepend
||Attempt to provide an estimate of when a queued job will start*|
||Show available system resources in a particular queue or for a particular group (see examples below)|
* Job estimates are generally unreliable, because most users don't provide the system with an estimate of how long their jobs will run for.
Refer to the official MOAB Documentation for a more comprehensive list of command for job management.
Example Usage of MOAB JOB Managing Commands:
Show available resources to General Access queue jobs:
showbf -q genacc_q
Show all jobs running in the Backfill queue:
showq -r -w class=backfill
Show all idle (pending) jobs in the General Access queue owned by user john:
showq -i -w class=genacc_q -w user=john
Suppose you submitted a job, and the job id is 7584526. If the job is not running, you can check its
future eligibility using
checkjob --flags=future 7584526
Suppose you submitted a job requiring a specific node, say, hpc-12-2, and the job id is 7584526. You can check the job/node status using
checkjob -n hpc-12-2 7584526
To cancel job with job id 7584526
To cancel all jobs in the state
mjobctl -c -w state=USERHOLD -w state=BATCHHOLD
You can monitor your jobs using commands discussed above such as
checkjob. To better understand the output of these commands, you need to understand the state of your job. The job's state indicates its current status and eligibility for execution. In the following we list possible job states:
|idle||Job is currently queued and eligible to run but is not executing.|
|hold||Job is idle and is not eligible to run due to user, admin, or batch system hold (refer to the
|deferred||Job is held by Moab due to an inability to schedule the job under current conditions (intermediate step before put in batch hold).|
|starting||Batch system has attempted to start the job and the job is currently performing pre-start tasks (e.g., executing system pre-launch scripts).|
|running||Job is currently executing the user application.|
|suspended||Job was running but has been suspended by the scheduler or an administrator; user application is still in place on the allocated compute resources, but it is not executing.|
|canceling||Job has been canceled and is in process of cleaning up.|
|completed||Job has completed running without failure.|
|removed||Job has run to its requested walltime successfully but has been canceled by the scheduler or resource manager due to exceeding its walltime or violating another policy; includes jobs canceled by users or administrators either before or after a job has started.|
|vacated||Job canceled after partial execution due to a system failure.|
If a submitted job can not run, it might be caused by one of the following reasons:
|job has hold in place||one or more job holds are currently in place|
|insufficient idle procs||there are currently not adequate processor resources available to start the job|
|idle procs do not meet requirements||adequate idle processors are available but these do not meet job requirements|
|start date not reached||job has specified a minimum start date which is still in the future|
|expected state is not idle||job is in an unexpected state|
|state is not idle||job is not in the idle state|
|dependency is not met||job depends on another job reaching a certain state|
|rejected by policy||job start is prevented by a throttling policy|
If a job can not run on a specific node, it might be caused by following reasons:
|Class||Node does not allow required job class/queue|
|CPU Node||does not possess required processors|
|Features||Node does not possess required node features|
|Memory||Node does not possess required real memory|
|State||Node is not Idle or Running|
A submitted job can be held by users, administrators, resource managers.
||Holds created by, and under the control of a non-previleged user, and maybe removed at any time by the user.|
||Holds handled by system administrator, normal user cannot release a system hold even on their own jobs.|
||Holds placed on a job by the schedular itself when it determines that a job cannot run.|
||A intermediate hold state before entering the BatchHold state|
(1). How many jobs can I submit simultaneously?
Depends on the queue you are using. Refer to HPC_Queues for details.
(2). How to check the status of my jobs?
You can use
showq -u username # for all your jobs qstat # for all your jobs checkjob job_id # for a specific job.
(3). Whey will my job start?
Depends on the resource available and the amount of resources you have requested. It also depends on your priority (you priority will decrease after you have submitted more jobs). A rough estimate can be made by the scheduler using
(4). Why my job is not running?
There can be many causes for this problem. You job can be put in idle/blocked states for a while when there is no adequate resource available. It can also be put on
batchhold when you violated, say, queue policy. Please check the
Job Eligibility and
Job Holds Section above for details.
(5). Can I modify my job parameters after submitting my jobs?
Yes. Some parameters can be modified after job submission. The
mjobctl -m command allows you the change some job parameters. For example, the following command increase the wall clock limit by 4 hours (provided the total is within the limit allowed by the queue you are using):
mjobctl -m wclimit+=4:00:00