Capacity Planning # Capacity Planning ## Overview Capacity planning for an HPC cluster means answering three questions: Are we using what we have? Do we need more? Where should we invest next? Slurm provides the data to answer all three through its accounting and reporting tools. This module covers utilization metrics, reporting tools, and planning strategies for IT leaders managing shared HPC resources. ## Key Utilization Metrics ### Cluster Utilization The most basic metric: what fraction of available resources are actively running jobs? ```bash # Cluster utilization over the past month sreport cluster utilization Start=2026-03-01 End=2026-04-01 # Cluster Allocated Down PLND Down Idle Reserved Reported # bioclust+ 72.4% 2.1% 0.5% 24.5% 0.5% 744:00:00 ``` | Metric | Meaning | Healthy Range | |--------|---------|---------------| | **Allocated** | Time CPUs/GPUs were running jobs | 60-85% | | **Idle** | Available but unused | 15-30% (some headroom is healthy) | | **Down** | Hardware failures, maintenance | < 5% | | **Reserved** | Held for reservations | Context-dependent | **Interpretation:** Sustained utilization above 85% with growing queue wait times indicates you need more capacity. Sustained utilization below 50% suggests over-provisioning or a user adoption problem. ### Queue Wait Times How long are jobs waiting before they start? ```bash # Average wait time by partition over the past month sacct -S 2026-03-01 -E 2026-04-01 --partition=gpu \ --format=Partition,Elapsed,Submit,Start \ --noheader | awk '... compute wait time ...' ``` Queue wait time is the metric researchers feel most directly. If GPU jobs routinely wait hours to start, it impacts scientific productivity. ### Per-Account Utilization Who is using the cluster, and how much? ```bash # Top accounts by CPU hours (past month) sreport account TopUsage Start=2026-03-01 End=2026-04-01 TopCount=10 # Account Login Proper Name Used # ---------- ---------- --------------- ---------- # cryoem 245000 # genomics 180000 # drugdisc 120000 # ml_team 85000 ``` ```bash # Top users sreport user TopUsage Start=2026-03-01 End=2026-04-01 TopCount=20 ``` This data feeds chargeback/showback reports and helps identify which groups would benefit most from additional capacity. ### GPU Utilization For GPU-heavy life science clusters (cryo-EM, ML, MD), GPU utilization is often the most important metric: ```bash # GPU hours by account sreport cluster AccountUtilizationByUser Start=2026-03-01 \ End=2026-04-01 -t Hours -T gres/gpu ``` ## Reporting Tools ### sreport Slurm's built-in reporting tool generates management-level summaries from the accounting database: | Report | Command | Shows | |--------|---------|-------| | Cluster utilization | `sreport cluster utilization` | Overall cluster busy/idle/down | | Top accounts | `sreport account TopUsage` | Largest consuming accounts | | Top users | `sreport user TopUsage` | Largest consuming users | | Account utilization | `sreport cluster AccountUtilizationByUser` | Breakdown by account and user | | User efficiency | `sreport job SizesByAccount` | Job size distribution | All sreport commands accept `Start=`, `End=`, and format options for custom reporting periods. ### sacct For detailed per-job analysis: ```bash # Find jobs that used more memory than requested (waste indicator) sacct -S 2026-03-01 --format=JobID,User,ReqMem,MaxRSS,Elapsed,ExitCode \ --noheader | awk '$4 > $3 * 0.8 {print}' # Find jobs with very low CPU efficiency sacct -S 2026-03-01 --format=JobID,User,AllocCPUS,TotalCPU,Elapsed \ --noheader ``` ### sdiag Scheduler-level diagnostics: ```bash sdiag # Server thread count: 10 # Jobs submitted: 45,231 # Jobs started: 44,892 # Jobs completed: 44,100 # Backfill cycle (microseconds): mean 125,432 # Backfill last depth: 2,500 ``` Key indicators from sdiag: - **Backfill cycle time:** if this exceeds several seconds, the scheduler may need tuning - **Jobs submitted vs. completed:** large gap indicates failures or cancellations ### External Monitoring For dashboards and alerting, integrate Slurm with Prometheus + Grafana: - **slurm-exporter:** Prometheus exporter for Slurm metrics (node states, job counts, queue depths) - **Custom sacct queries:** periodic exports to a data warehouse for trend analysis - **CloudWatch** (AWS): ParallelCluster and PCS emit metrics to CloudWatch ## Planning Strategies ### Right-Sizing Partitions Analyze which partitions are oversubscribed vs. underutilized: ```bash # Utilization per partition for part in cpu gpu highmem; do echo "=== $part ===" sreport cluster utilization Start=2026-03-01 Partitions=$part done ``` Common patterns: - **GPU partition always full:** add more GPU nodes or enable preemption to prioritize critical work - **CPU partition idle while GPU is full:** users may be requesting GPUs when CPU would suffice (education opportunity) - **High-memory partition underused:** consider reducing dedicated highmem nodes and using `--mem` on general compute instead ### When to Add Capacity Indicators that you need more resources: | Signal | Metric | Threshold | |--------|--------|-----------| | Long wait times | Average queue wait | > 2-4 hours for interactive, > 24h for batch | | High utilization | Allocated % | > 85% sustained | | Backfill inefficiency | Backfill depth | Many short jobs unable to backfill | | User complaints | Support tickets | Increasing trend | | Fairshare decay | `sshare -a` | All shares near zero (everyone is over-allocated) | ### When to Optimize Instead of Adding Sometimes adding capacity is not the answer: | Problem | Optimization | |---------|-------------| | Jobs requesting 64 cores but using 4 | User education on right-sizing | | Jobs requesting 500 GB RAM on highmem | Enforce memory limits via QOS | | One user consuming 80% of GPU | Fairshare + GrpTRES limits | | Short jobs waiting behind long jobs | Tune backfill parameters | | Idle reservations | Review reservation policies | ### Cloud Bursting Economics For organizations with both on-prem and cloud (ParallelCluster/PCS): **When to burst to cloud:** - On-prem utilization exceeds 85% for extended periods - Time-sensitive work (grant deadlines, publication timelines) - GPU workloads that exceed on-prem GPU count - Temporary capacity for large campaigns (virtual screening, cryo-EM data collection) **When to invest in on-prem:** - Sustained baseline utilization justifies capital expense - Data gravity (massive datasets that are expensive to move) - Regulatory/compliance requirements for data locality - GPU workloads running 24/7 (on-prem GPU is cheaper than cloud at >60% utilization) ### Capacity Planning Cycle A recommended quarterly review: 1. **Collect metrics:** sreport for the past quarter (utilization, wait times, top accounts) 2. **Analyze trends:** is utilization growing? Which groups are driving growth? 3. **Forecast demand:** talk to research leads about upcoming projects (new grants, campaigns) 4. **Model options:** more on-prem hardware vs. cloud bursting vs. scheduling optimization 5. **Present to stakeholders:** utilization report with recommendations and cost estimates ===== ## Related Modules - [Monitoring & Accounting](../admin/06-monitoring-accounting.md) -- sacct, sreport, sdiag details - [Policies & Priority](../admin/07-policies-priority.md) -- fairshare and scheduling policies - [Cost Allocation](cost-allocation.md) -- chargeback models - [Why Slurm](why-slurm.md) -- strategic context ## References - [SchedMD: sreport](https://slurm.schedmd.com/sreport.html) - [SchedMD: sacct](https://slurm.schedmd.com/sacct.html) - [SchedMD: sdiag](https://slurm.schedmd.com/sdiag.html) - [SchedMD: Resource Limits](https://slurm.schedmd.com/resource_limits.html) - [SchedMD: Priority/Multifactor](https://slurm.schedmd.com/priority_multifactor.html)