Capacity Planning

# Capacity Planning

## Overview

Capacity planning for an HPC cluster means answering three questions: Are we using what we have? Do we need more? Where should we invest next? Slurm provides the data to answer all three through its accounting and reporting tools.

This module covers utilization metrics, reporting tools, and planning strategies for IT leaders managing shared HPC resources.

## Key Utilization Metrics

### Cluster Utilization

The most basic metric: what fraction of available resources are actively running jobs?

```bash
# Cluster utilization over the past month
sreport cluster utilization Start=2026-03-01 End=2026-04-01
# Cluster  Allocated    Down     PLND Down     Idle   Reserved   Reported
# bioclust+  72.4%     2.1%       0.5%       24.5%      0.5%    744:00:00
```

| Metric | Meaning | Healthy Range |
|--------|---------|---------------|
| **Allocated** | Time CPUs/GPUs were running jobs | 60-85% |
| **Idle** | Available but unused | 15-30% (some headroom is healthy) |
| **Down** | Hardware failures, maintenance | < 5% |
| **Reserved** | Held for reservations | Context-dependent |

**Interpretation:** Sustained utilization above 85% with growing queue wait times indicates you need more capacity. Sustained utilization below 50% suggests over-provisioning or a user adoption problem.

### Queue Wait Times

How long are jobs waiting before they start?

```bash
# Average wait time by partition over the past month
sacct -S 2026-03-01 -E 2026-04-01 --partition=gpu \
    --format=Partition,Elapsed,Submit,Start \
    --noheader | awk '... compute wait time ...'
```

Queue wait time is the metric researchers feel most directly. If GPU jobs routinely wait hours to start, it impacts scientific productivity.

### Per-Account Utilization

Who is using the cluster, and how much?

```bash
# Top accounts by CPU hours (past month)
sreport account TopUsage Start=2026-03-01 End=2026-04-01 TopCount=10
# Account       Login     Proper Name       Used
# ---------- ---------- --------------- ----------
# cryoem                                  245000
# genomics                                180000
# drugdisc                                120000
# ml_team                                  85000
```

```bash
# Top users
sreport user TopUsage Start=2026-03-01 End=2026-04-01 TopCount=20
```

This data feeds chargeback/showback reports and helps identify which groups would benefit most from additional capacity.

### GPU Utilization

For GPU-heavy life science clusters (cryo-EM, ML, MD), GPU utilization is often the most important metric:

```bash
# GPU hours by account
sreport cluster AccountUtilizationByUser Start=2026-03-01 \
    End=2026-04-01 -t Hours -T gres/gpu
```

## Reporting Tools

### sreport

Slurm's built-in reporting tool generates management-level summaries from the accounting database:

| Report | Command | Shows |
|--------|---------|-------|
| Cluster utilization | `sreport cluster utilization` | Overall cluster busy/idle/down |
| Top accounts | `sreport account TopUsage` | Largest consuming accounts |
| Top users | `sreport user TopUsage` | Largest consuming users |
| Account utilization | `sreport cluster AccountUtilizationByUser` | Breakdown by account and user |
| User efficiency | `sreport job SizesByAccount` | Job size distribution |

All sreport commands accept `Start=`, `End=`, and format options for custom reporting periods.

### sacct

For detailed per-job analysis:

```bash
# Find jobs that used more memory than requested (waste indicator)
sacct -S 2026-03-01 --format=JobID,User,ReqMem,MaxRSS,Elapsed,ExitCode \
    --noheader | awk '$4 > $3 * 0.8 {print}'

# Find jobs with very low CPU efficiency
sacct -S 2026-03-01 --format=JobID,User,AllocCPUS,TotalCPU,Elapsed \
    --noheader
```

### sdiag

Scheduler-level diagnostics:

```bash
sdiag
# Server thread count: 10
# Jobs submitted: 45,231
# Jobs started: 44,892
# Jobs completed: 44,100
# Backfill cycle (microseconds): mean 125,432
# Backfill last depth: 2,500
```

Key indicators from sdiag:
- **Backfill cycle time:** if this exceeds several seconds, the scheduler may need tuning
- **Jobs submitted vs. completed:** large gap indicates failures or cancellations

### External Monitoring

For dashboards and alerting, integrate Slurm with Prometheus + Grafana:

- **slurm-exporter:** Prometheus exporter for Slurm metrics (node states, job counts, queue depths)
- **Custom sacct queries:** periodic exports to a data warehouse for trend analysis
- **CloudWatch** (AWS): ParallelCluster and PCS emit metrics to CloudWatch

## Planning Strategies

### Right-Sizing Partitions

Analyze which partitions are oversubscribed vs. underutilized:

```bash
# Utilization per partition
for part in cpu gpu highmem; do
    echo "=== $part ==="
    sreport cluster utilization Start=2026-03-01 Partitions=$part
done
```

Common patterns:
- **GPU partition always full:** add more GPU nodes or enable preemption to prioritize critical work
- **CPU partition idle while GPU is full:** users may be requesting GPUs when CPU would suffice (education opportunity)
- **High-memory partition underused:** consider reducing dedicated highmem nodes and using `--mem` on general compute instead

### When to Add Capacity

Indicators that you need more resources:

| Signal | Metric | Threshold |
|--------|--------|-----------|
| Long wait times | Average queue wait | > 2-4 hours for interactive, > 24h for batch |
| High utilization | Allocated % | > 85% sustained |
| Backfill inefficiency | Backfill depth | Many short jobs unable to backfill |
| User complaints | Support tickets | Increasing trend |
| Fairshare decay | `sshare -a` | All shares near zero (everyone is over-allocated) |

### When to Optimize Instead of Adding

Sometimes adding capacity is not the answer:

| Problem | Optimization |
|---------|-------------|
| Jobs requesting 64 cores but using 4 | User education on right-sizing |
| Jobs requesting 500 GB RAM on highmem | Enforce memory limits via QOS |
| One user consuming 80% of GPU | Fairshare + GrpTRES limits |
| Short jobs waiting behind long jobs | Tune backfill parameters |
| Idle reservations | Review reservation policies |

### Cloud Bursting Economics

For organizations with both on-prem and cloud (ParallelCluster/PCS):

**When to burst to cloud:**
- On-prem utilization exceeds 85% for extended periods
- Time-sensitive work (grant deadlines, publication timelines)
- GPU workloads that exceed on-prem GPU count
- Temporary capacity for large campaigns (virtual screening, cryo-EM data collection)

**When to invest in on-prem:**
- Sustained baseline utilization justifies capital expense
- Data gravity (massive datasets that are expensive to move)
- Regulatory/compliance requirements for data locality
- GPU workloads running 24/7 (on-prem GPU is cheaper than cloud at >60% utilization)

### Capacity Planning Cycle

A recommended quarterly review:

1. **Collect metrics:** sreport for the past quarter (utilization, wait times, top accounts)
2. **Analyze trends:** is utilization growing? Which groups are driving growth?
3. **Forecast demand:** talk to research leads about upcoming projects (new grants, campaigns)
4. **Model options:** more on-prem hardware vs. cloud bursting vs. scheduling optimization
5. **Present to stakeholders:** utilization report with recommendations and cost estimates

=====
## Related Modules

- [Monitoring & Accounting](../admin/06-monitoring-accounting.md) -- sacct, sreport, sdiag details
- [Policies & Priority](../admin/07-policies-priority.md) -- fairshare and scheduling policies
- [Cost Allocation](cost-allocation.md) -- chargeback models
- [Why Slurm](why-slurm.md) -- strategic context

## References

- [SchedMD: sreport](https://slurm.schedmd.com/sreport.html)
- [SchedMD: sacct](https://slurm.schedmd.com/sacct.html)
- [SchedMD: sdiag](https://slurm.schedmd.com/sdiag.html)
- [SchedMD: Resource Limits](https://slurm.schedmd.com/resource_limits.html)
- [SchedMD: Priority/Multifactor](https://slurm.schedmd.com/priority_multifactor.html)