Skip to content

Resource Management

Exercises

  1. Configure a GPU GRES

Add a new GPU node gpu09 with 2 NVIDIA A100 GPUs to the cluster. Define the GRES in both slurm.conf and gres.conf. Use AutoDetect=nvml in gres.conf.

Hint / Solution
# In slurm.conf, add the node definition:
#   NodeName=gpu09 CPUs=64 RealMemory=512000 Gres=gpu:a100:2

# In gres.conf (on gpu09 or globally):
#   AutoDetect=nvml
# Or explicitly:
#   Name=gpu Type=a100 File=/dev/nvidia0 Cores=0-31
#   Name=gpu Type=a100 File=/dev/nvidia1 Cores=32-63

# Restart slurmctld (new node requires restart)
systemctl restart slurmctld

# Start slurmd on the new node
ssh gpu09 systemctl start slurmd
  1. Verify GRES with scontrol show node

Inspect a GPU node to confirm the GRES are correctly detected and available. Check both the configured GRES count and how many are currently allocated.

Hint / Solution
# Show full node details including GRES
scontrol show node gpu01

# Look for these fields in the output:
#   Gres=gpu:a100:4
#   CfgTRES=cpu=64,mem=512000M,billing=64,gres/gpu=4
#   AllocTRES=              (empty if no jobs running)
#   GresUsed=gpu:a100:0(IDX:)

# Check all GPU nodes at once
scontrol show nodes gpu[01-08] | grep -E "NodeName|Gres|AllocTRES"
  1. Add a license resource

Add 20 Schrodinger Glide licenses as a cluster-wide resource. Submit a test job requesting one license and verify that Slurm tracks the allocation.

Hint / Solution
# In slurm.conf, add or update the Licenses line:
#   Licenses=schrodinger_glide:20

scontrol reconfigure

# Verify the licenses are registered
scontrol show licenses
# Should show: LicenseName=schrodinger_glide Total=20 Used=0 Free=20

# Submit a test job requesting a license
sbatch --licenses=schrodinger_glide:1 --wrap="sleep 60"

# Check that the license is allocated
scontrol show licenses
# Used should now be 1, Free should be 19
  1. Test cgroup memory enforcement

Verify that ConstrainRAMSpace=yes is set in cgroup.conf. Submit a job that intentionally tries to allocate more memory than requested and confirm that Slurm kills it with an OOM exit code.

Hint / Solution
# Verify cgroup settings on a compute node
ssh cpu001 cat /etc/slurm/cgroup.conf | grep ConstrainRAMSpace
# Should show: ConstrainRAMSpace=yes

# Submit a job requesting 100 MB but trying to allocate 500 MB
sbatch --mem=100M --wrap="python3 -c \"x = bytearray(500 * 1024 * 1024); import time; time.sleep(60)\""

# Wait for it to be killed, then check
sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS
# State should be OUT_OF_MEMORY
# ExitCode should show signal 9 (0:9) -- killed by SIGKILL via OOM
  1. Verify node features and test constraint matching

Check which features are configured on the GPU nodes. Submit a job that requires the a100 feature using --constraint and verify it lands on an A100 node.

Hint / Solution
# Check features on GPU nodes
scontrol show nodes gpu[01-08] | grep -E "NodeName|AvailableFeatures|ActiveFeatures"

# Submit a job requiring a100 feature
sbatch --constraint=a100 --gres=gpu:1 --wrap="nvidia-smi; hostname" -o feature_test.out

# After completion, check which node it ran on
sacct -j <jobid> --format=JobID,NodeList,State
# Should show one of the A100 GPU nodes

References