Monitoring & Accounting

# Monitoring & Accounting

## sacct: Job Accounting

`sacct` queries the SlurmDBD database for completed job data.

### Common Queries

```bash
# Jobs from today
sacct

# Specific job with resource details
sacct -j 12345 --format=JobID,JobName,Partition,Account,User,Elapsed,MaxRSS,MaxVMSize,TotalCPU,AllocCPUS,AllocTRES%60,State,ExitCode

# All jobs from a specific account this month
sacct --account=smith_lab --starttime=2026-04-01 --format=JobID,User,JobName,Elapsed,State

# Failed jobs in the last 7 days
sacct --starttime=now-7days --state=FAILED,TIMEOUT,OUT_OF_MEMORY --format=JobID,User,JobName,Elapsed,State,ExitCode

# Jobs on a specific partition
sacct --partition=gpu --starttime=now-7days --format=JobID,User,Elapsed,AllocTRES%40,State
```

### Key Format Fields

| Field | Description |
|-------|-------------|
| `Elapsed` | Actual walltime |
| `TotalCPU` | CPU time used (user + system) |
| `MaxRSS` | Peak resident memory |
| `ReqMem` | Requested memory |
| `AllocCPUS` | CPUs allocated |
| `AllocTRES` | All TRES allocated (CPUs, memory, GPUs) |
| `State` | Final state (COMPLETED, FAILED, TIMEOUT, etc.) |
| `ExitCode` | Return code and signal |
| `CPUTimeRAW` | CPU-seconds (for billing calculations) |

---

## sreport: Utilization Reports

`sreport` generates summary reports from accounting data.

### Cluster Utilization

```bash
# Overall cluster utilization this month
sreport cluster utilization start=2026-04-01

# By account
sreport cluster AccountUtilizationByUser start=2026-04-01 --tres=cpu

# Top users
sreport user TopUsage start=2026-04-01 --tres=cpu TopCount=20
```

### GPU Utilization

```bash
# GPU utilization by account
sreport cluster AccountUtilizationByUser start=2026-04-01 --tres=gres/gpu

# Top GPU users
sreport user TopUsage start=2026-04-01 --tres=gres/gpu TopCount=10
```

### Job Size Distribution

```bash
sreport job SizesByAccount start=2026-04-01 PrintJobCount
```

### Scheduled Reports

Combine with `scrontab` for automated reporting:

```
#SCRON --time=00:10:00
#SCRON --mem=1G
#SCRON -J weekly_report
0 8 * * 1 /opt/slurm/scripts/generate_weekly_report.sh
```

---

## sdiag: Scheduler Diagnostics

`sdiag` provides statistics about the scheduler's performance:

```bash
$ sdiag
*******************************************************
sdiag output at Mon Apr 12 14:30:00 2026
Data since      Mon Apr 05 00:00:00 2026
*******************************************************
Server thread count:  5
Agent queue size:     0

Jobs submitted: 12345
Jobs started:   11890
Jobs completed: 11500
Jobs failed:     340
Jobs cancelled:   50

Main schedule statistics (microseconds):
	Last cycle:   45123
	Mean cycle:   38456
	Max cycle:    234567
	Total cycles: 8640

Backfilling stats
	Total backfilled jobs (since last stats cycle): 3456
	Total backfilled heterogeneous job components: 0
	Total cycle count: 8640
	Last cycle when: Mon Apr 12 14:29:45 2026
	Last cycle: 12345
	Mean cycle: 10234
	Last depth cycle: 1500
	Last depth cycle (try sched): 200
	Last queue length: 456
```

Key metrics to watch:
- **Mean cycle time:** How long each scheduling pass takes (should be <1 second for most clusters)
- **Backfilled jobs:** Higher is better -- indicates the backfill scheduler is working
- **Agent queue size:** Should be 0 or near 0; high values indicate communication delays

---

## Log Files

| Log | Location | Content |
|-----|----------|---------|
| slurmctld | `/var/log/slurm/slurmctld.log` | Controller events, scheduling decisions, errors |
| slurmd | `/var/log/slurm/slurmd.log` | Per-node job launch, cgroup, process tracking |
| slurmdbd | `/var/log/slurm/slurmdbd.log` | Database operations, accounting |

### Adjusting Log Level

```bash
# Temporarily increase verbosity (no restart needed)
scontrol setdebug debug2

# Reset to normal
scontrol setdebug info

# Or set in slurm.conf permanently
SlurmctldDebug=info
```

### Log Rotation

Add to `/etc/logrotate.d/slurm`:

```
/var/log/slurm/*.log {
    weekly
    rotate 12
    compress
    missingok
    notifempty
    sharedscripts
    postrotate
        pkill -HUP slurmctld || true
        pkill -HUP slurmd || true
        pkill -HUP slurmdbd || true
    endscript
}
```

---

## External Monitoring Integration

> **ParallelCluster Note:** ParallelCluster publishes head node and compute fleet metrics to CloudWatch automatically (node counts, job states, scaling events). Use CloudWatch alongside Slurm-native tools like `sdiag` and `sreport` for a complete picture. CloudWatch Alarms can notify on unhealthy nodes or scaling failures.

> **PCS Note:** PCS integrates with CloudWatch for control-plane and compute metrics out of the box. Since you don't have access to slurmctld logs directly, CloudWatch is the primary tool for monitoring cluster health and scaling behavior.

### Prometheus + Grafana

The `slurm-exporter` (community project) exposes Slurm metrics for Prometheus scraping:

- Node state counts (idle, allocated, drained, down)
- Partition utilization
- Job queue depth
- Scheduler performance
- Fairshare data

### Job Completion Plugins

Slurm can push job completion data to external systems:

```ini
# slurm.conf
JobCompType=jobcomp/elasticsearch
JobCompLoc=http://elasticsearch:9200
```

Or use file-based logging:

```ini
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/job_completions
```

---

## Accounting Best Practices

1. **Enable accounting from day one.** Retroactive accounting is impossible.

2. **Archive old data.** Configure SlurmDBD to purge/archive old records:
   ```ini
   # slurmdbd.conf
   PurgeEventAfter=12months
   PurgeJobAfter=24months
   PurgeStepAfter=6months
   ```

3. **Use `sreport` for management** rather than raw `sacct` queries. It's designed for summary reporting.

4. **Monitor scheduler health** with `sdiag` regularly. Increasing cycle times indicate scaling issues.

5. **Set up automated reports** for stakeholders who need utilization data for budgets and planning.

Exercises¶

Generate a cluster utilization report

Use sreport to produce an overall cluster utilization report for the current month. Identify the percentage of CPU time that was allocated vs. idle.

Hint / Solution

# Overall cluster utilization
sreport cluster utilization start=2026-04-01

# Output shows columns: Allocated, Down, PLND Down (planned), Idle, Reserved, Reported
# Allocated / Reported = utilization percentage

# Break down by partition
sreport cluster utilization start=2026-04-01 -t percent

Find the top 5 users by CPU hours

Generate a report showing the five heaviest CPU consumers this month. Include both their CPU hours and the account they charged to.

Hint / Solution

sreport user TopUsage start=2026-04-01 --tres=cpu TopCount=5

# For GPU hours instead:
sreport user TopUsage start=2026-04-01 --tres=gres/gpu TopCount=5

# For a specific account's users only:
sreport user TopUsage start=2026-04-01 --tres=cpu TopCount=5 Accounts=smith_lab

Check scheduler health with sdiag

Run sdiag and assess the scheduler's performance. Determine the mean scheduling cycle time, how many jobs have been backfilled, and whether the agent queue is healthy.

Hint / Solution

sdiag

# Key things to check:
# 1. "Mean cycle" under "Main schedule statistics" -- should be well under 1,000,000 us (1 second)
# 2. "Total backfilled jobs" -- a healthy number indicates backfill is working
# 3. "Agent queue size" -- should be 0 or near 0; high values mean communication delays
# 4. "Last cycle" timestamp -- should be very recent

# Reset stats to track from a known point:
sdiag --reset

Find jobs that exceeded their memory request

Use sacct to find completed jobs from the past 7 days where the actual peak memory usage (MaxRSS) was close to or exceeded the requested memory. Identify jobs that were killed by OOM.

Hint / Solution

# Find jobs killed by OOM directly
sacct --starttime=now-7days --state=OUT_OF_MEMORY \
    --format=JobID,User,JobName,ReqMem,MaxRSS,Elapsed,State,ExitCode

# Find completed jobs where MaxRSS was high relative to request
# (look for jobs where MaxRSS approaches ReqMem)
sacct --starttime=now-7days --state=COMPLETED \
    --format=JobID%-12,User%-10,JobName%-20,ReqMem,MaxRSS,Elapsed \
    | sort -k5 -h | tail -20

# Find jobs killed by signal 9 (SIGKILL -- often OOM even if not tagged)
sacct --starttime=now-7days \
    --format=JobID,User,JobName,State,ExitCode | grep "0:9"

Monitoring & Accounting

Exercises¶

References¶