Job State Codes # Job State Codes ## Active States (visible in squeue) | Code | State | Description | |------|-------|-------------| | `PD` | PENDING | Waiting in queue for resources | | `R` | RUNNING | Executing on compute node(s) | | `CG` | COMPLETING | Job finishing, cleaning up processes | | `S` | SUSPENDED | Job suspended (preempted or by admin) | | `RS` | REQUEUE_HOLD | Held after being requeued | | `CF` | CONFIGURING | Nodes being configured (booting, etc.) | ## Final States (visible in sacct) | Code | State | Description | |------|-------|-------------| | `CD` | COMPLETED | Finished successfully (exit code 0) | | `F` | FAILED | Finished with non-zero exit code | | `TO` | TIMEOUT | Killed for exceeding time limit | | `CA` | CANCELLED | Cancelled by user or admin | | `OOM` | OUT_OF_MEMORY | Killed for exceeding memory limit | | `NF` | NODE_FAIL | Failed due to node failure | | `PR` | PREEMPTED | Preempted by higher-priority job | | `DL` | DEADLINE | Terminated at deadline | | `RQ` | REQUEUED | Requeued for re-execution | | `BF` | BOOT_FAIL | Node failed to boot | | `SE` | SPECIAL_EXIT | Requeued in held state | | `RV` | REVOKED | Revoked (federated clusters) | ## Exit Codes Format in sacct: `return_code:signal` | Example | Meaning | |---------|---------| | `0:0` | Success (exit 0, no signal) | | `1:0` | Application error (exit 1) | | `0:9` | Killed by SIGKILL (often OOM) | | `0:15` | Killed by SIGTERM (timeout or scancel) | | `127:0` | Command not found | | `137:0` | Killed by signal 9 (128+9, OOM or cgroup kill) | ## Common Pending Reasons | Reason | Description | What to Do | |--------|-------------|------------| | `(Resources)` | Waiting for nodes/CPUs/memory | Wait, or reduce resource request | | `(Priority)` | Lower priority than other jobs | Wait (fairshare adjusts) | | `(Dependency)` | Waiting for dependent job | Check parent job | | `(DependencyNeverSatisfied)` | Parent job failed | Cancel and resubmit | | `(QOSMaxJobsPerUserLimit)` | Hit per-user job limit | Wait for jobs to finish | | `(AssocGrpCPULimit)` | Account CPU limit reached | Wait or check with admin | | `(AssocGrpGRESLimit)` | Account GRES limit reached | Wait or check with admin | | `(ReqNodeNotAvail)` | Nodes unavailable | Check `sinfo` for node states | | `(PartitionTimeLimit)` | Time exceeds partition max | Reduce `--time` | | `(InvalidAccount)` | Account doesn't exist | Check `sacctmgr show account` | | `(InvalidQOS)` | QOS not available | Check allowed QOS | | `(BeginTime)` | Deferred start (`--begin`) | Wait or update | | `(JobHeldUser)` | Held by user | `scontrol release` | | `(JobHeldAdmin)` | Held by admin | Contact admin | ===== ## References - [SchedMD: Job State Codes](https://slurm.schedmd.com/squeue.html#lbAG) - [SchedMD: Job Reason Codes](https://slurm.schedmd.com/job_reason_codes.html) - [SchedMD: Job Exit Codes](https://slurm.schedmd.com/job_exit_code.html)