Skip to content

Monitoring & Accounting

Exercises

  1. Generate a cluster utilization report

Use sreport to produce an overall cluster utilization report for the current month. Identify the percentage of CPU time that was allocated vs. idle.

Hint / Solution
# Overall cluster utilization
sreport cluster utilization start=2026-04-01

# Output shows columns: Allocated, Down, PLND Down (planned), Idle, Reserved, Reported
# Allocated / Reported = utilization percentage

# Break down by partition
sreport cluster utilization start=2026-04-01 -t percent
  1. Find the top 5 users by CPU hours

Generate a report showing the five heaviest CPU consumers this month. Include both their CPU hours and the account they charged to.

Hint / Solution
sreport user TopUsage start=2026-04-01 --tres=cpu TopCount=5

# For GPU hours instead:
sreport user TopUsage start=2026-04-01 --tres=gres/gpu TopCount=5

# For a specific account's users only:
sreport user TopUsage start=2026-04-01 --tres=cpu TopCount=5 Accounts=smith_lab
  1. Check scheduler health with sdiag

Run sdiag and assess the scheduler's performance. Determine the mean scheduling cycle time, how many jobs have been backfilled, and whether the agent queue is healthy.

Hint / Solution
sdiag

# Key things to check:
# 1. "Mean cycle" under "Main schedule statistics" -- should be well under 1,000,000 us (1 second)
# 2. "Total backfilled jobs" -- a healthy number indicates backfill is working
# 3. "Agent queue size" -- should be 0 or near 0; high values mean communication delays
# 4. "Last cycle" timestamp -- should be very recent

# Reset stats to track from a known point:
sdiag --reset
  1. Find jobs that exceeded their memory request

Use sacct to find completed jobs from the past 7 days where the actual peak memory usage (MaxRSS) was close to or exceeded the requested memory. Identify jobs that were killed by OOM.

Hint / Solution
# Find jobs killed by OOM directly
sacct --starttime=now-7days --state=OUT_OF_MEMORY \
    --format=JobID,User,JobName,ReqMem,MaxRSS,Elapsed,State,ExitCode

# Find completed jobs where MaxRSS was high relative to request
# (look for jobs where MaxRSS approaches ReqMem)
sacct --starttime=now-7days --state=COMPLETED \
    --format=JobID%-12,User%-10,JobName%-20,ReqMem,MaxRSS,Elapsed \
    | sort -k5 -h | tail -20

# Find jobs killed by signal 9 (SIGKILL -- often OOM even if not tagged)
sacct --starttime=now-7days \
    --format=JobID,User,JobName,State,ExitCode | grep "0:9"

References