Architecture

# Slurm Architecture

## Overview

Slurm is a cluster workload manager that allocates resources, launches jobs, and manages queues. Its architecture is modular, scalable, and fault-tolerant -- designed to manage clusters from tens to millions of cores.

---

## Core Daemons

| Daemon | Runs On | Purpose |
|--------|---------|---------|
| **slurmctld** | Management node(s) | Central controller: job scheduling, resource allocation, state management |
| **slurmd** | Every compute node | Node daemon: launches, monitors, and manages jobs on each node |
| **slurmdbd** | Database host | Accounting daemon: stores job records, associations, and limits in MySQL/MariaDB |
| **slurmrestd** | API host (optional) | REST API daemon for programmatic Slurm access |
| **sackd** | Login nodes (optional) | Authentication daemon for configless login/submit nodes |

### Architecture Diagram

```
                    ┌─────────────────────────────────────┐
                    │          Management Node(s)          │
  Users ──────────► │  slurmctld (primary)                 │
  (login nodes)     │  slurmctld (backup, optional HA)     │
                    └──────────┬──────────────────┬────────┘
                               │                  │
                    ┌──────────▼──────┐  ┌────────▼────────┐
                    │  Compute Nodes   │  │   Database Host  │
                    │  slurmd (each)   │  │   slurmdbd       │
                    │  slurmd (each)   │  │   MySQL/MariaDB  │
                    │  slurmd (each)   │  └──────────────────┘
                    │  ...             │
                    └──────────────────┘
```

---

## slurmctld: The Central Controller

The brain of Slurm. Responsibilities:

- Maintains cluster state (nodes, partitions, jobs)
- Evaluates job requests against available resources
- Runs the scheduling algorithm (backfill by default)
- Dispatches jobs to compute nodes
- Handles job state transitions (pending → running → completed)
- Writes state checkpoints for crash recovery

**High availability:** Configure a backup slurmctld on a second host. If the primary fails, the backup takes over automatically.

```
SlurmctldHost=mgmt01(10.0.0.1)    # Primary
SlurmctldHost=mgmt02(10.0.0.2)    # Backup
```

---

## slurmd: The Node Daemon

Runs on every compute node. Responsibilities:

- Reports node resources (CPUs, memory, GPUs) to slurmctld
- Receives job launch requests from slurmctld
- Spawns job steps via `slurmstepd`
- Enforces resource limits (via cgroups)
- Reports job status and resource usage back to slurmctld

---

## slurmdbd: The Accounting Daemon

A dedicated daemon that mediates between slurmctld and the database:

- Stores job records, associations (accounts/users), fairshare data, and resource limits
- Buffers database writes for performance
- Supports multiple clusters sharing one database
- Backend: MySQL or MariaDB

**Why a separate daemon?** Decouples slurmctld from the database. If the database is slow or temporarily down, slurmctld continues to function and slurmdbd queues writes.

---

## Authentication

Slurm offers multiple authentication methods:

### Munge (Traditional Default)

[MUNGE](https://dun.github.io/munge/) provides symmetric-key authentication. Every node shares a common key (`/etc/munge/munge.key`), and the `munged` daemon handles credential creation and validation.

- **Pros:** Battle-tested, simple for on-prem clusters
- **Cons:** Requires munge daemon on every node, key distribution management

### auth/slurm (Modern Alternative, 23.11+)

Slurm's internal authentication plugin using HMAC SHA-256:

- Shared key file: `/etc/slurm/slurm.key`
- Simpler than Munge (no separate daemon)
- Supports LDAP-less operation via credential extensions
- Recommended for new deployments

### JWT (JSON Web Tokens)

Used as `AuthAltType` alongside Munge or auth/slurm:

- Primary use: REST API authentication via slurmrestd
- Supports external identity providers (AWS Cognito, KeyCloak, Azure AD) via JWKS
- RS256 (asymmetric) keys for external providers, HS256 for internal use
- Not suitable for all operations (interactive `srun` not supported with JWT)

### sackd: Authentication on Login Nodes

The `sackd` daemon runs on login/submit nodes to provide:

- **Configless operation:** Fetches `slurm.conf` from slurmctld (no local config files needed)
- Authentication services via Unix socket
- TLS support for secure config retrieval
- JWKS support for auth/slurm

```bash
# Start sackd in configless mode
sackd --conf-server mgmt01:6817
```

---

## Communication Model

All Slurm daemons communicate over TCP:

| Communication Path | Default Port | Protocol |
|-------------------|-------------|----------|
| Client → slurmctld | 6817 | Authenticated RPC |
| slurmctld → slurmd | 6818 | Authenticated RPC |
| slurmctld → slurmdbd | 6819 | Authenticated RPC |
| Client → slurmrestd | 6820 (configurable) | HTTPS + JWT |

---

## State Management

slurmctld maintains state in memory and periodically checkpoints to disk (`StateSaveLocation`). On restart:

- `-r` (default): Recovers jobs and DOWN/DRAIN node states
- `-R`: Full recovery including partition state and power-save settings
- `-c`: Clear all state (destructive -- rarely used in production)

---

## Plugin Architecture

Slurm's modularity comes from its plugin system. Key plugin types:

| Plugin Type | Purpose | Common Values |
|------------|---------|---------------|
| `AuthType` | Authentication | `auth/munge`, `auth/slurm` |
| `SelectType` | Resource selection | `select/cons_tres` (consumable TRES, recommended) |
| `SchedulerType` | Scheduling algorithm | `sched/backfill` (default, recommended) |
| `PriorityType` | Job prioritization | `priority/multifactor` (default) |
| `AccountingStorageType` | Accounting backend | `accounting_storage/slurmdbd` |
| `JobCompType` | Job completion logging | `jobcomp/filetxt`, `jobcomp/elasticsearch` |
| `TaskPlugin` | Task management | `task/cgroup`, `task/affinity` |
| `ProctrackType` | Process tracking | `proctrack/cgroup` |
| `GresTypes` | Generic resources | `gpu` (most common) |

---

## Deployment Variants

### On-Premise

You manage all daemons, database, and configuration. Full control, full responsibility.

### AWS ParallelCluster

> **ParallelCluster Note:** The head node runs slurmctld, slurmdbd, and MariaDB on a single EC2 instance. There is no native HA for the controller -- if the head node is lost, it must be recreated. Size the head node instance appropriately (e.g., `c5.2xlarge` or larger for clusters >500 nodes).

ParallelCluster manages slurmctld, slurmdbd, and the database on the head node. It adds:

- Dynamic compute nodes that launch on demand (EC2 instances)
- Power-saving daemon that terminates idle nodes
- Custom slurm.conf parameters via `CustomSlurmSettings`
- Some slurm.conf parameters are deny-listed (managed by ParallelCluster)

### AWS PCS

> **PCS Note:** The Slurm control plane (slurmctld, slurmdbd, database) is fully managed by AWS and does not run in your account. You interact with it through the PCS API and console -- there is no SSH access to the controller.

A fully managed Slurm control plane. You configure through the PCS API rather than editing slurm.conf directly.

Architecture

References¶