Policies & Priority

# Policies & Priority

## Multifactor Priority

Slurm's default priority plugin computes a composite score from multiple factors:

```
Job Priority = PriorityWeightAge * AgeFactor
             + PriorityWeightFairshare * FairshareFactor
             + PriorityWeightJobSize * JobSizeFactor
             + PriorityWeightPartition * PartitionFactor
             + PriorityWeightQOS * QOSFactor
             + PriorityWeightAssoc * AssocFactor
             + PriorityWeightTRES * TRESFactors
             - nice
```

### Configuration

```ini
# slurm.conf
PriorityType=priority/multifactor
PriorityWeightFairshare=100000    # Dominant factor
PriorityWeightAge=1000            # Time in queue
PriorityWeightPartition=10000     # Partition importance
PriorityWeightJobSize=1000        # Favor larger jobs (helps utilization)
PriorityWeightQOS=10000           # QOS priority
PriorityDecayHalfLife=14-0        # Usage decays over 14 days
PriorityMaxAge=7-0                # Age factor maxes out after 7 days
```

### Viewing Priority

```bash
# See priority breakdown for pending jobs
$ sprio
JOBID   PARTITION  USER      PRIORITY  AGE   FAIRSHARE  JOBSIZE  PARTITION  QOS
12345   batch      jdoe      85123     500   75000      1000     8000       623
12346   batch      asmith    82500     200   73000      1000     8000       300

# Detailed view
$ sprio -l -j 12345

# See configured weights
$ sprio -w
```

---

## Scheduling Order

The scheduler evaluates jobs in this order:

1. Jobs that can **preempt** (if preemption is enabled)
2. Jobs with an **advanced reservation**
3. **Partition PriorityTier** (higher tiers evaluated first)
4. **Job priority** (the multifactor score)
5. **Job submit time** (FIFO tiebreaker)
6. **Job ID** (final tiebreaker)

This means a debug partition with `PriorityTier=200` is evaluated before batch with `PriorityTier=100`, regardless of individual job priorities.

---

## Preemption

Preemption allows high-priority jobs to displace running lower-priority jobs.

### Preemption Modes

| Mode | Action |
|------|--------|
| `CANCEL` | Kill the preempted job |
| `REQUEUE` | Requeue the preempted job (restarts from scratch) |
| `SUSPEND` | Suspend the preempted job (resume later) |
| `GANG` | Time-slice between preempting and preempted jobs |

### Configuration

```ini
# slurm.conf
PreemptType=preempt/qos           # QOS-based preemption
PreemptMode=REQUEUE               # Default action
PreemptExemptTime=00:05:00        # Jobs running <5 min are exempt
```

### QOS-Based Preemption

```bash
# high QOS can preempt normal and low
sacctmgr modify qos high set Preempt=normal,low PreemptMode=REQUEUE GraceTime=120

# urgent QOS can preempt everything
sacctmgr modify qos urgent set Preempt=normal,low,high PreemptMode=CANCEL
```

### Partition-Based Preemption

```ini
# slurm.conf
PreemptType=preempt/partition_prio

PartitionName=premium Nodes=node[001-100] PriorityTier=200 PreemptMode=REQUEUE
PartitionName=standard Nodes=node[001-100] PriorityTier=100
PartitionName=scavenger Nodes=node[001-100] PriorityTier=50
```

Jobs in `premium` preempt `standard` and `scavenger`. Jobs in `standard` preempt `scavenger`.

---

## Backfill Scheduling

The backfill scheduler is critical for cluster utilization. It allows lower-priority jobs to start on reserved resources if they'll finish before the higher-priority job needs them.

```ini
# slurm.conf
SchedulerType=sched/backfill
SchedulerParameters=bf_max_job_test=5000,bf_interval=30,bf_resolution=600
```

| Parameter | Description | Default |
|-----------|-------------|---------|
| `bf_interval` | Seconds between backfill cycles | 30 |
| `bf_max_job_test` | Max jobs evaluated per cycle | 500 |
| `bf_resolution` | Time granularity for backfill (seconds) | 60 |

**Why walltime matters for users:** The backfill scheduler uses job time limits to predict resource availability. Jobs with accurate (shorter) time limits are more likely to be backfilled. This is why user training should emphasize setting reasonable `--time` values.

---

## Reservations

Reserve specific resources for maintenance, special events, or guaranteed access.

### Create a Reservation

```bash
# Maintenance window
scontrol create reservation ReservationName=maint_apr15 \
    StartTime=2026-04-15T02:00:00 Duration=120 \
    Nodes=node[001-010] User=root Flags=MAINT,IGNORE_JOBS

# Reserved for a specific user/account
scontrol create reservation ReservationName=workshop \
    StartTime=2026-04-20T09:00:00 EndTime=2026-04-20T17:00:00 \
    NodeCnt=10 Accounts=training_account

# TRES-based reservation (e.g., reserve 200 GPUs)
scontrol create reservation ReservationName=gpu_block \
    StartTime=now Duration=480 \
    TRES=gres/gpu=200 User=jdoe
```

### Manage Reservations

```bash
# List reservations
scontrol show reservation

# Users submit to a reservation
sbatch --reservation=workshop my_job.sh

# Modify
scontrol update ReservationName=maint_apr15 Duration=180

# Delete
scontrol delete ReservationName=workshop
```

### Reservation Flags

| Flag | Purpose |
|------|---------|
| `MAINT` | Maintenance reservation (no user jobs) |
| `IGNORE_JOBS` | Allow reservation to overlap running jobs |
| `FLEX` | Allow job to use more or fewer nodes than reserved |
| `DAILY` / `WEEKLY` | Recurring reservation |

---

## Resource Quotas via QOS

Beyond the QOS limits covered in [Partitions & QOS](03-partitions-qos.md), common quota patterns:

### Limit Total Cluster Usage Per Account

```bash
sacctmgr modify account labA set GrpTRES=cpu=1024,gres/gpu=16
```

### Limit Per-User Concurrent Usage

```bash
sacctmgr modify qos normal set MaxTRESPerUser=cpu=256,gres/gpu=4 MaxJobsPerUser=50
```

### Limit Job Size

```bash
sacctmgr modify qos normal set MaxTRESPerJob=cpu=64,node=4
```

### Implement "Burst" Access

A "burst" QOS that allows short periods of heavy usage:

```bash
sacctmgr add qos burst set \
    Priority=500 \
    MaxTRESPerUser=cpu=512 \
    MaxWallDurationPerJob=04:00:00 \
    UsageFactor=2.0    # Charges double for fairshare
```

---

## Policy Design Recommendations

1. **Start simple.** Fairshare with equal shares for all accounts is a good starting point. Add complexity only when needed.

2. **Fairshare is usually sufficient.** Don't add preemption unless users explicitly need it. Preemption adds complexity and can cause work loss.

3. **Use partition PriorityTier for evaluation order** (debug first, then interactive, then batch). Use QOS for priority within a tier.

4. **Monitor with `sprio` and `sshare`.** If users don't understand why their jobs are waiting, the policy is too complex.

5. **Document your policies.** Publish a clear page explaining how scheduling works, what the partitions/QOS mean, and how to check priority.

Policies & Priority

References¶