Policies & Priority # Policies & Priority ## Multifactor Priority Slurm's default priority plugin computes a composite score from multiple factors: ``` Job Priority = PriorityWeightAge * AgeFactor + PriorityWeightFairshare * FairshareFactor + PriorityWeightJobSize * JobSizeFactor + PriorityWeightPartition * PartitionFactor + PriorityWeightQOS * QOSFactor + PriorityWeightAssoc * AssocFactor + PriorityWeightTRES * TRESFactors - nice ``` ### Configuration ```ini # slurm.conf PriorityType=priority/multifactor PriorityWeightFairshare=100000 # Dominant factor PriorityWeightAge=1000 # Time in queue PriorityWeightPartition=10000 # Partition importance PriorityWeightJobSize=1000 # Favor larger jobs (helps utilization) PriorityWeightQOS=10000 # QOS priority PriorityDecayHalfLife=14-0 # Usage decays over 14 days PriorityMaxAge=7-0 # Age factor maxes out after 7 days ``` ### Viewing Priority ```bash # See priority breakdown for pending jobs $ sprio JOBID PARTITION USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS 12345 batch jdoe 85123 500 75000 1000 8000 623 12346 batch asmith 82500 200 73000 1000 8000 300 # Detailed view $ sprio -l -j 12345 # See configured weights $ sprio -w ``` --- ## Scheduling Order The scheduler evaluates jobs in this order: 1. Jobs that can **preempt** (if preemption is enabled) 2. Jobs with an **advanced reservation** 3. **Partition PriorityTier** (higher tiers evaluated first) 4. **Job priority** (the multifactor score) 5. **Job submit time** (FIFO tiebreaker) 6. **Job ID** (final tiebreaker) This means a debug partition with `PriorityTier=200` is evaluated before batch with `PriorityTier=100`, regardless of individual job priorities. --- ## Preemption Preemption allows high-priority jobs to displace running lower-priority jobs. ### Preemption Modes | Mode | Action | |------|--------| | `CANCEL` | Kill the preempted job | | `REQUEUE` | Requeue the preempted job (restarts from scratch) | | `SUSPEND` | Suspend the preempted job (resume later) | | `GANG` | Time-slice between preempting and preempted jobs | ### Configuration ```ini # slurm.conf PreemptType=preempt/qos # QOS-based preemption PreemptMode=REQUEUE # Default action PreemptExemptTime=00:05:00 # Jobs running <5 min are exempt ``` ### QOS-Based Preemption ```bash # high QOS can preempt normal and low sacctmgr modify qos high set Preempt=normal,low PreemptMode=REQUEUE GraceTime=120 # urgent QOS can preempt everything sacctmgr modify qos urgent set Preempt=normal,low,high PreemptMode=CANCEL ``` ### Partition-Based Preemption ```ini # slurm.conf PreemptType=preempt/partition_prio PartitionName=premium Nodes=node[001-100] PriorityTier=200 PreemptMode=REQUEUE PartitionName=standard Nodes=node[001-100] PriorityTier=100 PartitionName=scavenger Nodes=node[001-100] PriorityTier=50 ``` Jobs in `premium` preempt `standard` and `scavenger`. Jobs in `standard` preempt `scavenger`. --- ## Backfill Scheduling The backfill scheduler is critical for cluster utilization. It allows lower-priority jobs to start on reserved resources if they'll finish before the higher-priority job needs them. ```ini # slurm.conf SchedulerType=sched/backfill SchedulerParameters=bf_max_job_test=5000,bf_interval=30,bf_resolution=600 ``` | Parameter | Description | Default | |-----------|-------------|---------| | `bf_interval` | Seconds between backfill cycles | 30 | | `bf_max_job_test` | Max jobs evaluated per cycle | 500 | | `bf_resolution` | Time granularity for backfill (seconds) | 60 | **Why walltime matters for users:** The backfill scheduler uses job time limits to predict resource availability. Jobs with accurate (shorter) time limits are more likely to be backfilled. This is why user training should emphasize setting reasonable `--time` values. --- ## Reservations Reserve specific resources for maintenance, special events, or guaranteed access. ### Create a Reservation ```bash # Maintenance window scontrol create reservation ReservationName=maint_apr15 \ StartTime=2026-04-15T02:00:00 Duration=120 \ Nodes=node[001-010] User=root Flags=MAINT,IGNORE_JOBS # Reserved for a specific user/account scontrol create reservation ReservationName=workshop \ StartTime=2026-04-20T09:00:00 EndTime=2026-04-20T17:00:00 \ NodeCnt=10 Accounts=training_account # TRES-based reservation (e.g., reserve 200 GPUs) scontrol create reservation ReservationName=gpu_block \ StartTime=now Duration=480 \ TRES=gres/gpu=200 User=jdoe ``` ### Manage Reservations ```bash # List reservations scontrol show reservation # Users submit to a reservation sbatch --reservation=workshop my_job.sh # Modify scontrol update ReservationName=maint_apr15 Duration=180 # Delete scontrol delete ReservationName=workshop ``` ### Reservation Flags | Flag | Purpose | |------|---------| | `MAINT` | Maintenance reservation (no user jobs) | | `IGNORE_JOBS` | Allow reservation to overlap running jobs | | `FLEX` | Allow job to use more or fewer nodes than reserved | | `DAILY` / `WEEKLY` | Recurring reservation | --- ## Resource Quotas via QOS Beyond the QOS limits covered in [Partitions & QOS](03-partitions-qos.md), common quota patterns: ### Limit Total Cluster Usage Per Account ```bash sacctmgr modify account labA set GrpTRES=cpu=1024,gres/gpu=16 ``` ### Limit Per-User Concurrent Usage ```bash sacctmgr modify qos normal set MaxTRESPerUser=cpu=256,gres/gpu=4 MaxJobsPerUser=50 ``` ### Limit Job Size ```bash sacctmgr modify qos normal set MaxTRESPerJob=cpu=64,node=4 ``` ### Implement "Burst" Access A "burst" QOS that allows short periods of heavy usage: ```bash sacctmgr add qos burst set \ Priority=500 \ MaxTRESPerUser=cpu=512 \ MaxWallDurationPerJob=04:00:00 \ UsageFactor=2.0 # Charges double for fairshare ``` --- ## Policy Design Recommendations 1. **Start simple.** Fairshare with equal shares for all accounts is a good starting point. Add complexity only when needed. 2. **Fairshare is usually sufficient.** Don't add preemption unless users explicitly need it. Preemption adds complexity and can cause work loss. 3. **Use partition PriorityTier for evaluation order** (debug first, then interactive, then batch). Use QOS for priority within a tier. 4. **Monitor with `sprio` and `sshare`.** If users don't understand why their jobs are waiting, the policy is too complex. 5. **Document your policies.** Publish a clear page explaining how scheduling works, what the partitions/QOS mean, and how to check priority. References¶ SchedMD: Multifactor Priority Plugin SchedMD: Preemption SchedMD: Scheduling Configuration Guide SchedMD: Advanced Resource Reservation Guide SchedMD: sprio man page