Skip to content

Troubleshooting

Exercises

  1. Diagnose why a job is pending

Submit a job to the gpu partition requesting 8 GPUs (more than any single node has if nodes have 4 each) and observe the pending reason. Use squeue to identify the reason code, then determine how to fix the request.

Hint / Solution
# Submit a job requesting more GPUs than a single node has
sbatch -p gpu --gres=gpu:8 --nodes=1 --wrap="sleep 120"

# Check the pending reason
squeue -u $USER -o "%.8i %.9P %.20j %.2t %.30R"
# Expected reason: (ReqNodeNotAvail) or (Resources)

# Get more details
scontrol show job <jobid> | grep Reason

# Fix: either request fewer GPUs per node, or span multiple nodes:
#   --gres=gpu:4 --nodes=2 --ntasks-per-node=1
scancel <jobid>
  1. Drain a node and verify it stops accepting jobs

Drain node cpu010 with a reason string. Confirm the node shows as draining/drained in sinfo. Submit a job targeting that specific node and verify it won't run there.

Hint / Solution
# Drain the node
scontrol update NodeName=cpu010 State=DRAIN Reason="exercise: testing drain"

# Verify the state
sinfo -n cpu010
# State should show drain, drng (draining), or drained

# Check the reason
scontrol show node cpu010 | grep -i "state\|reason"

# Try to submit a job to that node -- it should pend
sbatch --nodelist=cpu010 --wrap="hostname"
squeue -u $USER
# Job will pend with reason (ReqNodeNotAvail) until the node is resumed

# Clean up
scancel -u $USER
  1. Bring a drained node back online

After the previous exercise, resume node cpu010 and verify it returns to an idle state and can accept new jobs.

Hint / Solution
# Resume the node
scontrol update NodeName=cpu010 State=RESUME

# Verify it's back to idle
sinfo -n cpu010
# State should show idle (or idle* if no slurmd running in a lab environment)

scontrol show node cpu010 | grep -i state
# State=IDLE

# Test that it accepts jobs
srun --nodelist=cpu010 hostname
# Should return: cpu010
  1. Find a job that was killed by OOM

Search completed jobs from the past 7 days for any that ended in the OUT_OF_MEMORY state. For each, determine how much memory was requested vs. how much was actually used at peak.

Hint / Solution
# Find OOM-killed jobs
sacct --starttime=now-7days --state=OUT_OF_MEMORY \
    --format=JobID%-12,User%-10,JobName%-20,ReqMem,MaxRSS,Elapsed,ExitCode,NodeList

# For a specific job, get the full picture
sacct -j <jobid> --format=JobID,JobName,State,ExitCode,ReqMem,MaxRSS,MaxVMSize,AllocCPUS

# ExitCode 0:9 means killed by SIGKILL (OOM killer)
# MaxRSS close to or at ReqMem confirms memory exhaustion

# Also check for jobs killed by signal 9 that might not be tagged as OOM
sacct --starttime=now-7days --format=JobID,User,State,ExitCode | grep "0:9"
  1. Investigate a node in DOWN state

Using scontrol, examine a node that shows as down or down*. Determine the reason it went down, check whether slurmd is running on it, and bring it back if possible.

Hint / Solution
# Find down nodes
sinfo -N -l | grep down

# Get details on a specific down node
scontrol show node cpu005 | grep -i "state\|reason\|lastbusy"

# Check if slurmd is running on the node
ssh cpu005 systemctl status slurmd

# If slurmd is stopped, restart it
ssh cpu005 systemctl restart slurmd

# If slurmd is running but the node is still down*, clear the state
scontrol update NodeName=cpu005 State=RESUME

# Verify recovery
sinfo -n cpu005

References