Troubleshooting

# Troubleshooting

## Two Problem Categories

Like SGE, Slurm problems fall into two categories:

1. **Job-level:** The cluster is fine, but specific user jobs fail or won't start
2. **Cluster-level:** Systemic issues affecting all jobs or entire nodes

Always verify the cluster is healthy before debugging individual jobs.

---

## Quick Health Check

```bash
# 1. Are all nodes up?
sinfo -N -l | grep -v idle
# Look for: down, drain, error, not responding

# 2. Is the controller responsive?
scontrol ping
# Expected: Slurmctld(primary) at mgmt01 is UP

# 3. Any scheduler issues?
sdiag | head -20

# 4. Can you run a trivial job?
srun hostname
```

---

## Job-Level Troubleshooting

### "Why Is My Job Pending?"

```bash
# Check the reason
squeue -j 12345 -o "%.8i %.9P %.20j %.8u %.2t %R"
JOBID     PARTITION  NAME                 USER      ST  REASON
12345     batch      my_analysis          jdoe      PD  (Priority)
```

Common pending reasons and fixes:

| Reason | Meaning | Action |
|--------|---------|--------|
| `(Resources)` | Not enough free resources | Wait, or reduce resource request |
| `(Priority)` | Lower priority than other pending jobs | Wait (fairshare will adjust over time) |
| `(QOSMaxJobsPerUserLimit)` | Hit concurrent job limit | Wait for running jobs to finish |
| `(AssocGrpCPULimit)` | Account/user hit CPU limit | Check limits with `sacctmgr` |
| `(ReqNodeNotAvail)` | Requested node(s) unavailable | Check node state with `sinfo` |
| `(PartitionTimeLimit)` | Requested time exceeds partition max | Reduce `--time` or use different partition |
| `(Dependency)` | Waiting for dependent job | Check parent job status |
| `(DependencyNeverSatisfied)` | Parent job failed | Cancel and resubmit without dependency |
| `(InvalidAccount)` | Account doesn't exist | Verify with `sacctmgr show account` |
| `(InvalidQOS)` | QOS not available to user | Check `sacctmgr show user` for allowed QOS |
| `(BeginTime)` | Job has a deferred start time | Set by `--begin`, wait or update |

### "Why Did My Job Fail?"

```bash
# Check exit code
sacct -j 12345 --format=JobID,State,ExitCode,DerivedExitCode
JobID         State      ExitCode  DerivedExitCode
------        ---------  --------  ---------------
12345         FAILED     1:0       1:0
12345.batch   FAILED     1:0

# ExitCode format: return_code:signal
# 1:0 = exit code 1, no signal
# 0:9 = killed by signal 9 (SIGKILL) -- often OOM
```

| State | Common Cause | Investigation |
|-------|-------------|---------------|
| `FAILED` | Script error, application crash | Check stdout/stderr files |
| `TIMEOUT` | Exceeded walltime | Increase `--time`, or optimize code |
| `OUT_OF_MEMORY` | Exceeded memory limit | Increase `--mem`, check `MaxRSS` from sacct |
| `CANCELLED` | User or admin cancelled | Check who cancelled: `sacct -j 12345 -o JobID,State,AdminComment` |
| `NODE_FAIL` | Node crashed during execution | Check slurmd logs on the node |

### "Job Runs From CLI But Not Through Slurm"

The most common root cause: **environment differences.**

Slurm jobs run in a clean environment, not your interactive login shell. Checklist:

1. **Are modules loaded in the script?** Add `module load ...` explicitly
2. **Are paths absolute?** Relative paths may not resolve from the compute node
3. **Is `$PATH` correct?** Add `echo $PATH` at the top of your script to debug
4. **Is the working directory correct?** Use `cd $SLURM_SUBMIT_DIR` or `--chdir`
5. **File permissions?** Can the compute node access your input files?

---

## Cluster-Level Troubleshooting

### Node States

```bash
$ sinfo -N -l
NODENAME  CPUS  MEMORY  STATE       REASON
node001   64    256000  idle        
node002   64    256000  allocated   
node003   64    256000  drain       Memory error reported
node004   64    256000  down*       Not responding
gpu01     64    512000  mixed       
```

| State | Meaning | Action |
|-------|---------|--------|
| `idle` | Available, no jobs | Normal |
| `alloc` / `allocated` | Fully in use | Normal |
| `mix` / `mixed` | Partially allocated | Normal |
| `drain` / `draining` | Admin-flagged, completing jobs | Check reason, fix, then resume |
| `drained` | Drained, no jobs running | Fix issue, then resume |
| `down` | Unavailable | Check slurmd, network, hardware |
| `down*` | Not responding | slurmd not communicating |
| `error` | Error detected | Check slurmd log, fix, clear state |

### Fixing Node States

```bash
# Check why a node is down/drained
scontrol show node node003 | grep -i "state\|reason"
   State=DRAINED Reason=Memory error reported

# After fixing the issue, resume the node
scontrol update NodeName=node003 State=RESUME

# If slurmd isn't running, restart it
ssh node003 systemctl restart slurmd

# For a node in error state, clear it
scontrol update NodeName=node003 State=IDLE

# Force a node down (e.g., for maintenance)
scontrol update NodeName=node003 State=DRAIN Reason="hardware replacement"
```

### slurmctld Not Responding

```bash
# Check if it's running
systemctl status slurmctld

# Check logs
tail -100 /var/log/slurm/slurmctld.log

# Common issues:
# - slurm.conf syntax error after reconfigure
# - StateSaveLocation disk full
# - Munge/auth failure
# - slurmdbd unreachable (accounting issues)
```

### slurmd Not Responding on a Node

```bash
# SSH to the node and check
ssh node003
systemctl status slurmd
journalctl -u slurmd --since "1 hour ago"

# Common issues:
# - slurm.conf out of sync with controller
# - Munge key mismatch
# - Disk full (/tmp or SlurmdSpoolDir)
# - Network firewall blocking port 6818
```

---

## Diagnostic Tools

### scontrol show job (Full Job Details)

```bash
scontrol show job 12345
# Shows: requested resources, allocated nodes, submit/start/end times,
# working directory, command, output paths, reason codes
```

### squeue --start (Estimated Start Times)

```bash
squeue --me --start
# Shows estimated start time for pending jobs
```

### scontrol show node (Full Node Details)

```bash
scontrol show node node003
# Shows: state, reason, CPUs, memory, GRES, running jobs, features
```

> **ParallelCluster Note:** Dynamic compute nodes use power-saving states not seen on static clusters: `idle~` (powered down, will launch on demand) and `idle#` (currently launching). Nodes stuck in `idle#` indicate EC2 launch failures -- check `/var/log/parallelcluster/clustermgtd.log` and `/var/log/parallelcluster/slurm_resume.log` on the head node. Common causes: insufficient capacity in the AZ, instance type limits, or subnet exhaustion.

---

## Basic Debug Process

1. **Verify cluster health:** `sinfo`, `scontrol ping`
2. **Check the specific job/node:** `scontrol show job`, `scontrol show node`
3. **Read the logs:** slurmctld.log, slurmd.log, job output files
4. **Test outside Slurm:** If a job fails, try running the command interactively on a compute node
5. **Check environment:** Modules, paths, permissions, filesystem accessibility
6. **Increase verbosity:** `scontrol setdebug debug` (temporarily!)

Exercises¶

Diagnose why a job is pending

Submit a job to the gpu partition requesting 8 GPUs (more than any single node has if nodes have 4 each) and observe the pending reason. Use squeue to identify the reason code, then determine how to fix the request.

Hint / Solution

# Submit a job requesting more GPUs than a single node has
sbatch -p gpu --gres=gpu:8 --nodes=1 --wrap="sleep 120"

# Check the pending reason
squeue -u $USER -o "%.8i %.9P %.20j %.2t %.30R"
# Expected reason: (ReqNodeNotAvail) or (Resources)

# Get more details
scontrol show job <jobid> | grep Reason

# Fix: either request fewer GPUs per node, or span multiple nodes:
#   --gres=gpu:4 --nodes=2 --ntasks-per-node=1
scancel <jobid>

Drain a node and verify it stops accepting jobs

Drain node cpu010 with a reason string. Confirm the node shows as draining/drained in sinfo. Submit a job targeting that specific node and verify it won't run there.

Hint / Solution

# Drain the node
scontrol update NodeName=cpu010 State=DRAIN Reason="exercise: testing drain"

# Verify the state
sinfo -n cpu010
# State should show drain, drng (draining), or drained

# Check the reason
scontrol show node cpu010 | grep -i "state\|reason"

# Try to submit a job to that node -- it should pend
sbatch --nodelist=cpu010 --wrap="hostname"
squeue -u $USER
# Job will pend with reason (ReqNodeNotAvail) until the node is resumed

# Clean up
scancel -u $USER

Bring a drained node back online

After the previous exercise, resume node cpu010 and verify it returns to an idle state and can accept new jobs.

Hint / Solution

# Resume the node
scontrol update NodeName=cpu010 State=RESUME

# Verify it's back to idle
sinfo -n cpu010
# State should show idle (or idle* if no slurmd running in a lab environment)

scontrol show node cpu010 | grep -i state
# State=IDLE

# Test that it accepts jobs
srun --nodelist=cpu010 hostname
# Should return: cpu010

Find a job that was killed by OOM

Search completed jobs from the past 7 days for any that ended in the OUT_OF_MEMORY state. For each, determine how much memory was requested vs. how much was actually used at peak.

Hint / Solution

# Find OOM-killed jobs
sacct --starttime=now-7days --state=OUT_OF_MEMORY \
    --format=JobID%-12,User%-10,JobName%-20,ReqMem,MaxRSS,Elapsed,ExitCode,NodeList

# For a specific job, get the full picture
sacct -j <jobid> --format=JobID,JobName,State,ExitCode,ReqMem,MaxRSS,MaxVMSize,AllocCPUS

# ExitCode 0:9 means killed by SIGKILL (OOM killer)
# MaxRSS close to or at ReqMem confirms memory exhaustion

# Also check for jobs killed by signal 9 that might not be tagged as OOM
sacct --starttime=now-7days --format=JobID,User,State,ExitCode | grep "0:9"

Investigate a node in DOWN state

Using scontrol, examine a node that shows as down or down*. Determine the reason it went down, check whether slurmd is running on it, and bring it back if possible.

Hint / Solution

# Find down nodes
sinfo -N -l | grep down

# Get details on a specific down node
scontrol show node cpu005 | grep -i "state\|reason\|lastbusy"

# Check if slurmd is running on the node
ssh cpu005 systemctl status slurmd

# If slurmd is stopped, restart it
ssh cpu005 systemctl restart slurmd

# If slurmd is running but the node is still down*, clear the state
scontrol update NodeName=cpu005 State=RESUME

# Verify recovery
sinfo -n cpu005

Troubleshooting

Exercises¶

References¶