Partitions & QOS
Exercises¶
- Create a new partition for long-running jobs
Add a long partition in slurm.conf that uses nodes cpu[001-064], allows a maximum walltime of 14 days, and has a lower PriorityTier than the default batch partition. Apply the change and verify the partition exists.
Hint / Solution
- Create a QOS with specific resource limits
Create a QOS called restricted that limits each user to 32 CPUs and 2 GPUs across all their running jobs, and caps concurrent running jobs at 10 per user.
Hint / Solution
- Assign a QOS to an account
Allow all users in the smith_lab account to use both the normal and restricted QOS. Set normal as their default.
Hint / Solution
- Test that a QOS limit prevents over-allocation
Create a test QOS called tiny that allows a maximum of 2 CPUs per user total. Assign it to a test user. Submit two single-CPU jobs (they should start), then submit a third and confirm it is held pending with a QOS limit reason.
Hint / Solution
# Create the restrictive QOS
sacctmgr add qos tiny set MaxTRESPerUser=cpu=2 MaxJobsPerUser=10
# Assign to a test user
sacctmgr modify user testuser set qos=tiny defaultqos=tiny
# As testuser, submit jobs:
su - testuser
sbatch --qos=tiny --wrap="sleep 300" -n 1
sbatch --qos=tiny --wrap="sleep 300" -n 1
sbatch --qos=tiny --wrap="sleep 300" -n 1
# Check status -- the third job should be pending
squeue -u testuser -o "%.8i %.2t %.20R"
# Expected reason: (QOSMaxCpuPerUserLimit) or (QOSMaxTRESPerUser)
# Clean up
scancel -u testuser
sacctmgr delete qos tiny