Skip to content

Partitions & QOS

Exercises

  1. Create a new partition for long-running jobs

Add a long partition in slurm.conf that uses nodes cpu[001-064], allows a maximum walltime of 14 days, and has a lower PriorityTier than the default batch partition. Apply the change and verify the partition exists.

Hint / Solution
# Add to slurm.conf:
#   PartitionName=long Nodes=cpu[001-064] MaxTime=14-00:00:00 DefaultTime=04:00:00 PriorityTier=50 State=UP

# Apply
scontrol reconfigure

# Verify
scontrol show partition long
sinfo -p long
  1. Create a QOS with specific resource limits

Create a QOS called restricted that limits each user to 32 CPUs and 2 GPUs across all their running jobs, and caps concurrent running jobs at 10 per user.

Hint / Solution
sacctmgr add qos restricted set \
    MaxTRESPerUser=cpu=32,gres/gpu=2 \
    MaxJobsPerUser=10 \
    Priority=0

# Verify
sacctmgr show qos restricted format=Name,MaxTRESPerUser,MaxJobsPerUser,Priority
  1. Assign a QOS to an account

Allow all users in the smith_lab account to use both the normal and restricted QOS. Set normal as their default.

Hint / Solution
# Add the QOS to the account
sacctmgr modify account smith_lab set qos=normal,restricted defaultqos=normal

# Verify
sacctmgr show association where account=smith_lab format=Account,User,QOS,DefaultQOS
  1. Test that a QOS limit prevents over-allocation

Create a test QOS called tiny that allows a maximum of 2 CPUs per user total. Assign it to a test user. Submit two single-CPU jobs (they should start), then submit a third and confirm it is held pending with a QOS limit reason.

Hint / Solution
# Create the restrictive QOS
sacctmgr add qos tiny set MaxTRESPerUser=cpu=2 MaxJobsPerUser=10

# Assign to a test user
sacctmgr modify user testuser set qos=tiny defaultqos=tiny

# As testuser, submit jobs:
su - testuser
sbatch --qos=tiny --wrap="sleep 300" -n 1
sbatch --qos=tiny --wrap="sleep 300" -n 1
sbatch --qos=tiny --wrap="sleep 300" -n 1

# Check status -- the third job should be pending
squeue -u testuser -o "%.8i %.2t %.20R"
# Expected reason: (QOSMaxCpuPerUserLimit) or (QOSMaxTRESPerUser)

# Clean up
scancel -u testuser
sacctmgr delete qos tiny

References