Resource Management

# Resource Management

## GRES: Generic Resources

GRES extends Slurm's native CPU/memory management to arbitrary resources -- GPUs, FPGAs, licenses, or custom resources.

### GPU Configuration

In `slurm.conf`:

```ini
GresTypes=gpu
NodeName=gpu[01-08] Gres=gpu:a100:4 CPUs=64 RealMemory=512000
```

> **ParallelCluster Note:** GRES (including GPUs) is auto-detected from EC2 instance types. ParallelCluster generates `gres.conf` automatically -- no manual GPU configuration is needed. GPU-enabled instances (p4d, p5, g5, etc.) are discovered at node launch.

> **On-Prem Note:** GPU nodes require manual `gres.conf` configuration (shown below), or `AutoDetect=nvml` if the NVIDIA management library is installed on all GPU nodes.

In `gres.conf` (on each GPU node, or globally if using AutoDetect):

```ini
# /etc/slurm/gres.conf

# Auto-detect NVIDIA GPUs (recommended)
AutoDetect=nvml

# Or explicit configuration:
# Name=gpu Type=a100 File=/dev/nvidia0 Cores=0-15
# Name=gpu Type=a100 File=/dev/nvidia1 Cores=16-31
# Name=gpu Type=a100 File=/dev/nvidia2 Cores=32-47
# Name=gpu Type=a100 File=/dev/nvidia3 Cores=48-63
```

`AutoDetect=nvml` uses NVIDIA's management library to discover GPUs automatically, including type, cores, and topology. This is the recommended approach.

### GPU Isolation with cgroups

Critical for multi-user GPU nodes:

```ini
# cgroup.conf
ConstrainDevices=yes    # Restricts GPU device access per job
```

Without this, a job allocated 1 GPU could access all GPUs on the node. With cgroups, `CUDA_VISIBLE_DEVICES` is set automatically and device access is enforced at the kernel level.

### Software License Management

Track scarce software licenses as consumable resources:

```ini
# slurm.conf -- cluster-wide license pool
Licenses=matlab:50,gaussian:20,schrodinger_glide:100
```

Users request licenses at submission:

```bash
sbatch --licenses=matlab:1 my_job.sh
```

Slurm tracks allocations and won't dispatch jobs unless licenses are available.

For external license servers (FLEXlm), use periodic license updates:

```bash
# Update available license count dynamically
scontrol update licenses matlab:45    # Set current available count
```

Or automate with a cron job / scrontab that queries the FLEXlm server.

---

## Memory Management

### Configuration

```ini
# slurm.conf
SelectTypeParameters=CR_Core_Memory    # Track memory as consumable resource
```

```ini
# cgroup.conf
ConstrainRAMSpace=yes       # Enforce memory limits via cgroups
ConstrainSwapSpace=yes      # Include swap in memory accounting
```

### Default Memory per CPU

Set a default so users don't accidentally get the entire node's memory:

```ini
# slurm.conf
DefMemPerCPU=4096    # 4 GB default per CPU (in MB)
```

### Node Memory

Report slightly less than physical RAM to leave room for the OS:

```ini
# For a node with 256 GB physical RAM:
NodeName=node001 RealMemory=250000    # ~244 GB available for jobs
```

---

## Node Features and Constraints

Tag nodes with features for user-driven scheduling:

```ini
# slurm.conf
NodeName=node[001-024] Features=haswell,ethernet,ssd
NodeName=node[025-048] Features=skylake,infiniband
NodeName=gpu[01-04] Features=a100,nvlink,pcie4
NodeName=gpu[05-08] Features=v100,pcie3
```

Users select with `--constraint`:

```bash
sbatch --constraint=infiniband my_mpi_job.sh
sbatch --constraint="a100&nvlink" my_gpu_job.sh
```

### Active Features (Dynamic)

Nodes can change features at boot (e.g., different kernel configurations):

```ini
NodeName=flex[1-10] Features=centos8 ActiveFeatures=centos8
```

---

## Resource Limits Hierarchy

Limits can be set at multiple levels. The most restrictive applies:

```
Partition limits (slurm.conf)
  └── Partition QOS limits (sacctmgr)
        └── Account limits (sacctmgr)
              └── User limits (sacctmgr)
                    └── Job QOS limits (sacctmgr)
```

### Example: Layered Limits

```bash
# Account level: Lab can use 512 CPUs total
sacctmgr modify account smith_lab set GrpTRES=cpu=512

# User level: Each user max 128 CPUs
sacctmgr modify user where account=smith_lab set MaxTRESPerUser=cpu=128

# QOS level: "normal" QOS max 64 CPUs per job
sacctmgr modify qos normal set MaxTRESPerJob=cpu=64
```

Result: A user in smith_lab with qos=normal can submit a job requesting up to 64 CPUs. They can have multiple running jobs up to 128 CPUs total. The lab as a whole can't exceed 512 CPUs.

---

## TRES: Trackable Resources

TRES extends resource accounting beyond CPUs to include memory, GPUs, nodes, licenses, and billing weights:

```ini
# slurm.conf -- configure billing weights per partition
PartitionName=gpu TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=10.0"
```

This means 1 GPU is "worth" 10 CPUs for billing/fairshare purposes. A job using 4 CPUs and 1 GPU is billed as 4 + 10 = 14 billing units.

View TRES usage:

```bash
sacct -j 12345 --format=JobID,AllocTRES%60
sreport cluster utilization
```

---

## cgroup Resource Enforcement

Without cgroups, Slurm resource limits are advisory. With cgroups, they're enforced:

| cgroup Setting | Enforces |
|----------------|----------|
| `ConstrainCores=yes` | CPU core binding |
| `ConstrainRAMSpace=yes` | Memory limit (OOM kill if exceeded) |
| `ConstrainSwapSpace=yes` | Swap + memory combined limit |
| `ConstrainDevices=yes` | GPU/device access isolation |

```ini
# cgroup.conf (recommended production settings)
CgroupPlugin=cgroup/v2
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=100      # % of requested (100 = strict)
AllowedSwapSpace=0       # No swap allowed
```

Exercises¶

Configure a GPU GRES

Add a new GPU node gpu09 with 2 NVIDIA A100 GPUs to the cluster. Define the GRES in both slurm.conf and gres.conf. Use AutoDetect=nvml in gres.conf.

Hint / Solution

# In slurm.conf, add the node definition:
#   NodeName=gpu09 CPUs=64 RealMemory=512000 Gres=gpu:a100:2

# In gres.conf (on gpu09 or globally):
#   AutoDetect=nvml
# Or explicitly:
#   Name=gpu Type=a100 File=/dev/nvidia0 Cores=0-31
#   Name=gpu Type=a100 File=/dev/nvidia1 Cores=32-63

# Restart slurmctld (new node requires restart)
systemctl restart slurmctld

# Start slurmd on the new node
ssh gpu09 systemctl start slurmd

Verify GRES with scontrol show node

Inspect a GPU node to confirm the GRES are correctly detected and available. Check both the configured GRES count and how many are currently allocated.

Hint / Solution

# Show full node details including GRES
scontrol show node gpu01

# Look for these fields in the output:
#   Gres=gpu:a100:4
#   CfgTRES=cpu=64,mem=512000M,billing=64,gres/gpu=4
#   AllocTRES=              (empty if no jobs running)
#   GresUsed=gpu:a100:0(IDX:)

# Check all GPU nodes at once
scontrol show nodes gpu[01-08] | grep -E "NodeName|Gres|AllocTRES"

Add a license resource

Add 20 Schrodinger Glide licenses as a cluster-wide resource. Submit a test job requesting one license and verify that Slurm tracks the allocation.

Hint / Solution

# In slurm.conf, add or update the Licenses line:
#   Licenses=schrodinger_glide:20

scontrol reconfigure

# Verify the licenses are registered
scontrol show licenses
# Should show: LicenseName=schrodinger_glide Total=20 Used=0 Free=20

# Submit a test job requesting a license
sbatch --licenses=schrodinger_glide:1 --wrap="sleep 60"

# Check that the license is allocated
scontrol show licenses
# Used should now be 1, Free should be 19

Test cgroup memory enforcement

Verify that ConstrainRAMSpace=yes is set in cgroup.conf. Submit a job that intentionally tries to allocate more memory than requested and confirm that Slurm kills it with an OOM exit code.

Hint / Solution

# Verify cgroup settings on a compute node
ssh cpu001 cat /etc/slurm/cgroup.conf | grep ConstrainRAMSpace
# Should show: ConstrainRAMSpace=yes

# Submit a job requesting 100 MB but trying to allocate 500 MB
sbatch --mem=100M --wrap="python3 -c \"x = bytearray(500 * 1024 * 1024); import time; time.sleep(60)\""

# Wait for it to be killed, then check
sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS
# State should be OUT_OF_MEMORY
# ExitCode should show signal 9 (0:9) -- killed by SIGKILL via OOM

Verify node features and test constraint matching

Check which features are configured on the GPU nodes. Submit a job that requires the a100 feature using --constraint and verify it lands on an A100 node.

Hint / Solution

# Check features on GPU nodes
scontrol show nodes gpu[01-08] | grep -E "NodeName|AvailableFeatures|ActiveFeatures"

# Submit a job requiring a100 feature
sbatch --constraint=a100 --gres=gpu:1 --wrap="nvidia-smi; hostname" -o feature_test.out

# After completion, check which node it ran on
sacct -j <jobid> --format=JobID,NodeList,State
# Should show one of the A100 GPU nodes

Resource Management

Exercises¶

References¶