Cost Allocation # Cost Allocation ## Overview HPC resources cost money -- whether it is capital expense for on-prem hardware or pay-per-use cloud instances. Cost allocation answers the question: **who is consuming what, and how do we pay for it?** Slurm's accounting system tracks every job's resource consumption (CPU hours, GPU hours, memory, wall time) and attributes it to accounts and users. This data feeds chargeback (billing groups for actual costs) and showback (reporting usage without billing) models. ## Prerequisites: Slurm Accounting Cost allocation requires Slurm accounting (slurmdbd) to be enabled and configured with: 1. **Accounts** -- organizational units (departments, labs, grants) 2. **Users** -- mapped to accounts 3. **Associations** -- user-to-account mappings with optional limits ```bash # Example account hierarchy for a life science organization sacctmgr add account bioteam Description="BioTeam Organization" sacctmgr add account structural parent=bioteam Description="Structural Biology" sacctmgr add account drugdisc parent=bioteam Description="Drug Discovery" sacctmgr add account genomics parent=bioteam Description="Genomics" # Add users to accounts sacctmgr add user jsmith Account=structural sacctmgr add user kpatel Account=drugdisc DefaultAccount=drugdisc ``` Users can belong to multiple accounts and specify which account to charge at job submission: ```bash sbatch --account=structural my_cryoem_job.sh sbatch --account=drugdisc my_glide_job.sh ``` ## Chargeback vs. Showback | Model | Description | Billing | Use Case | |-------|-------------|---------|----------| | **Showback** | Report usage, no actual billing | Informational | Academic, internal teams | | **Chargeback** | Bill groups for actual resource consumption | Real money | Core facilities, shared services | Most life science HPC operations start with showback and evolve toward chargeback as the organization matures. ## Usage Reporting with sreport ### CPU Hours by Account ```bash # Total CPU hours per account, past quarter sreport cluster AccountUtilizationByUser \ Start=2026-01-01 End=2026-04-01 -t Hours \ Format=Account,Login,Used # Account Login Proper Name Used(CPU Hours) # ---------- ---------- --------------- ---------- # structural jsmith J. Smith 42,500 # structural mlee M. Lee 28,300 # drugdisc kpatel K. Patel 65,100 # drugdisc achen A. Chen 31,200 # genomics rwilson R. Wilson 98,400 ``` ### GPU Hours by Account ```bash # GPU hours per account (critical for cryo-EM and ML) sreport cluster AccountUtilizationByUser \ Start=2026-01-01 End=2026-04-01 -t Hours \ -T gres/gpu \ Format=Account,Login,Used # Account Login Proper Name Used(GPU Hours) # ---------- ---------- --------------- ---------- # structural jsmith J. Smith 12,800 # structural mlee M. Lee 8,200 # drugdisc kpatel K. Patel 4,500 # genomics rwilson R. Wilson 200 ``` ### Top Consumers ```bash # Top 10 accounts by usage sreport account TopUsage Start=2026-01-01 End=2026-04-01 TopCount=10 # Top 20 users by usage sreport user TopUsage Start=2026-01-01 End=2026-04-01 TopCount=20 ``` ### Monthly Trend Reports Generate monthly usage for trend analysis: ```bash for month in 01 02 03; do echo "=== 2026-$month ===" sreport cluster AccountUtilizationByUser \ Start=2026-${month}-01 End=2026-$((month+1))-01 -t Hours \ Format=Account,Used done ``` ## Cost Models ### On-Premise: Amortized Cost per Core-Hour For on-prem clusters, calculate the fully loaded cost per core-hour: ``` Annual cluster cost = Hardware amortization + Power + Cooling + Staff + Facilities = $500,000 + $120,000 + $40,000 + $200,000 + $50,000 = $910,000/year Available core-hours = Cores × Hours/year × Target utilization = 2,048 cores × 8,760 hours × 0.75 = 13,468,800 core-hours Cost per core-hour = $910,000 / 13,468,800 = $0.068 ``` For GPU nodes, calculate separately: ``` GPU node annual cost = $80,000 amortization + $15,000 power = $95,000/year per 4-GPU node GPU-hours available = 4 GPUs × 8,760 hours × 0.70 = 24,528 GPU-hours Cost per GPU-hour = $95,000 / 24,528 = $3.87 ``` ### Cloud: Direct Instance Cost For ParallelCluster or PCS, the cost is straightforward -- EC2 instance pricing: | Instance | On-Demand/hr | GPUs | CPU-hour Cost | GPU-hour Cost | |----------|-------------|------|---------------|---------------| | c5.4xlarge | $0.68 | - | $0.043 | - | | r5.12xlarge | $3.024 | - | $0.063 | - | | g5.12xlarge | $5.672 | 4x A10G | $0.118 | $1.42 | | p4d.24xlarge | $32.77 | 8x A100 | $0.341 | $4.10 | Add storage (FSx Lustre, EFS), data transfer, and NAT gateway costs for a complete picture. ### Mapping Slurm Usage to Costs ```bash # Export usage data for a billing period sacct -S 2026-03-01 -E 2026-04-01 \ --format=Account,User,Partition,AllocCPUS,AllocTRES,Elapsed,CPUTimeRAW \ --parsable2 --noheader > usage_march_2026.csv ``` Then multiply by the cost-per-unit for each resource type: ``` Monthly charge = (CPU-hours × $CPU_rate) + (GPU-hours × $GPU_rate) + (Storage-GB × $storage_rate) ``` ## Enforcing Budgets with QOS Limits Slurm can enforce spending caps via QOS and association limits: ### Per-Account Limits ```bash # Limit the structural biology group to 500 concurrent CPU cores sacctmgr modify account structural set GrpTRES=cpu=500 # Limit the drug discovery group to 16 concurrent GPUs sacctmgr modify account drugdisc set GrpTRES=gres/gpu=16 # Limit a single user to 5 concurrent jobs sacctmgr modify user kpatel set MaxJobs=5 ``` ### QOS for Tiered Service ```bash # Standard QOS (default) sacctmgr add qos standard Priority=50 MaxTRESPerUser=cpu=128,gres/gpu=4 # Premium QOS (for time-sensitive work, higher priority) sacctmgr add qos premium Priority=100 MaxTRESPerUser=cpu=512,gres/gpu=16 # Burst QOS (for large campaigns, lower priority) sacctmgr add qos burst Priority=10 MaxSubmitJobsPerUser=1000 ``` Assign QOS to accounts: ```bash sacctmgr modify account structural set QOS=standard,premium sacctmgr modify account drugdisc set QOS=standard,premium,burst ``` Users select QOS at job submission: ```bash sbatch --qos=premium --account=drugdisc urgent_fep.sh ``` ## Cloud Cost Attribution ### ParallelCluster Tag EC2 instances by Slurm account using a prolog script or custom launch template. Use AWS Cost Explorer with tags to break down spending by group. ### PCS PCS compute node groups map to partitions. If you design one partition per department (e.g., `cryoem-gpu`, `drugdisc-cpu`), AWS Cost Explorer can attribute costs by instance group. For shared partitions, extract usage from `sacct` and multiply by the hourly instance cost for the period. ### Spot Instance Savings Track Spot vs. On-Demand usage: ```bash # Jobs on spot partition sacct -S 2026-03-01 --partition=spot-cpu \ --format=Account,AllocCPUS,Elapsed,CPUTimeRAW --noheader ``` Report the savings: Spot typically costs 60-90% less than On-Demand. ## Sample Monthly Report A monthly cost allocation report might include: ``` === BioTeam HPC Usage Report: March 2026 === Cluster Utilization: 74.2% (target: 70-85%) Total CPU-hours consumed: 265,500 Total GPU-hours consumed: 25,800 Usage by Account: Account CPU-hours GPU-hours Est. Cost structural 70,800 21,000 $86,220 drugdisc 96,300 4,500 $24,120 genomics 98,400 300 $6,810 Queue Wait Times (average): cpu partition: 22 min gpu partition: 1.8 hrs highmem: 45 min Recommendations: - GPU queue wait times elevated; consider 2 additional A100 nodes - Genomics group not using GPUs; CPU allocation adequate - Drug discovery burst campaign completed; QOS limits can revert ``` ===== ## Related Modules - [Accounts & Fairshare](../admin/04-accounts-fairshare.md) -- sacctmgr account setup - [Capacity Planning](capacity-planning.md) -- utilization analysis - [Monitoring & Accounting](../admin/06-monitoring-accounting.md) -- reporting tools - [AWS ParallelCluster](../deployment/aws-parallelcluster.md) -- cloud cost context ## References - [SchedMD: sreport](https://slurm.schedmd.com/sreport.html) - [SchedMD: sacctmgr](https://slurm.schedmd.com/sacctmgr.html) - [SchedMD: Accounting and Resource Limits](https://slurm.schedmd.com/accounting.html) - [SchedMD: Quality of Service (QOS)](https://slurm.schedmd.com/qos.html) - [SchedMD: Resource Limits](https://slurm.schedmd.com/resource_limits.html) - [AWS: Cost Explorer](https://docs.aws.amazon.com/cost-management/latest/userguide/ce-what-is.html)