Monitoring & Accounting
Exercises¶
- Generate a cluster utilization report
Use sreport to produce an overall cluster utilization report for the current month. Identify the percentage of CPU time that was allocated vs. idle.
Hint / Solution
- Find the top 5 users by CPU hours
Generate a report showing the five heaviest CPU consumers this month. Include both their CPU hours and the account they charged to.
Hint / Solution
- Check scheduler health with sdiag
Run sdiag and assess the scheduler's performance. Determine the mean scheduling cycle time, how many jobs have been backfilled, and whether the agent queue is healthy.
Hint / Solution
sdiag
# Key things to check:
# 1. "Mean cycle" under "Main schedule statistics" -- should be well under 1,000,000 us (1 second)
# 2. "Total backfilled jobs" -- a healthy number indicates backfill is working
# 3. "Agent queue size" -- should be 0 or near 0; high values mean communication delays
# 4. "Last cycle" timestamp -- should be very recent
# Reset stats to track from a known point:
sdiag --reset
- Find jobs that exceeded their memory request
Use sacct to find completed jobs from the past 7 days where the actual peak memory usage (MaxRSS) was close to or exceeded the requested memory. Identify jobs that were killed by OOM.
Hint / Solution
# Find jobs killed by OOM directly
sacct --starttime=now-7days --state=OUT_OF_MEMORY \
--format=JobID,User,JobName,ReqMem,MaxRSS,Elapsed,State,ExitCode
# Find completed jobs where MaxRSS was high relative to request
# (look for jobs where MaxRSS approaches ReqMem)
sacct --starttime=now-7days --state=COMPLETED \
--format=JobID%-12,User%-10,JobName%-20,ReqMem,MaxRSS,Elapsed \
| sort -k5 -h | tail -20
# Find jobs killed by signal 9 (SIGKILL -- often OOM even if not tagged)
sacct --starttime=now-7days \
--format=JobID,User,JobName,State,ExitCode | grep "0:9"