Node State Codes # Node State Codes ## Primary States | State | Description | Admin Action | |-------|-------------|-------------| | `idle` | No jobs, available | None needed | | `alloc` / `allocated` | Fully allocated to jobs | None needed | | `mix` / `mixed` | Partially allocated | None needed | | `drain` / `draining` | Draining (completing jobs, no new ones) | Fix issue, then `RESUME` | | `drained` | Drained, all jobs done | Fix issue, then `RESUME` | | `down` | Unavailable | Investigate, fix, then `RESUME` | | `down*` | Down + not responding | Check slurmd, network | | `error` | Error state detected | Check slurmd log, fix, clear | | `future` | Not yet configured | Expected for planned nodes | | `idle~` | Powered down (cloud/power-save) | Will power on when needed | | `idle#` | Powering up | Wait for boot to complete | | `alloc#` | Allocated but still powering up | Wait for boot | | `down~` | Powered down + down | Investigate | | `reboot` | Rebooting | Wait | ## State Suffixes | Suffix | Meaning | |--------|---------| | `*` | Not responding (slurmd not communicating) | | `~` | Powered down | | `#` | Powering up | | `$` | Maintenance reservation | | `@` | Pending reboot | | `^` | Planned (reboot pending) | ## Managing Node States ```bash # Drain for maintenance scontrol update NodeName=node001 State=DRAIN Reason="disk replacement" # Return to service scontrol update NodeName=node001 State=RESUME # Force idle (clears error/down) scontrol update NodeName=node001 State=IDLE # Undrain without changing base state scontrol update NodeName=node001 State=UNDRAIN # Mark down (kills running jobs on node) scontrol update NodeName=node001 State=DOWN Reason="hardware failure" # Check reason for current state scontrol show node node001 | grep -E "State|Reason" ``` ## Checking Node States ```bash # All non-idle nodes sinfo -t alloc,mix,drain,down,error -N -o "%N %T %E" # Nodes with reasons sinfo -R # Specific partition sinfo -p gpu -N -l ``` ===== ## References - [SchedMD: scontrol -- Node State](https://slurm.schedmd.com/scontrol.html) - [SchedMD: sinfo man page](https://slurm.schedmd.com/sinfo.html)