Architecture # Slurm Architecture ## Overview Slurm is a cluster workload manager that allocates resources, launches jobs, and manages queues. Its architecture is modular, scalable, and fault-tolerant -- designed to manage clusters from tens to millions of cores. --- ## Core Daemons | Daemon | Runs On | Purpose | |--------|---------|---------| | **slurmctld** | Management node(s) | Central controller: job scheduling, resource allocation, state management | | **slurmd** | Every compute node | Node daemon: launches, monitors, and manages jobs on each node | | **slurmdbd** | Database host | Accounting daemon: stores job records, associations, and limits in MySQL/MariaDB | | **slurmrestd** | API host (optional) | REST API daemon for programmatic Slurm access | | **sackd** | Login nodes (optional) | Authentication daemon for configless login/submit nodes | ### Architecture Diagram ``` ┌─────────────────────────────────────┐ │ Management Node(s) │ Users ──────────► │ slurmctld (primary) │ (login nodes) │ slurmctld (backup, optional HA) │ └──────────┬──────────────────┬────────┘ │ │ ┌──────────▼──────┐ ┌────────▼────────┐ │ Compute Nodes │ │ Database Host │ │ slurmd (each) │ │ slurmdbd │ │ slurmd (each) │ │ MySQL/MariaDB │ │ slurmd (each) │ └──────────────────┘ │ ... │ └──────────────────┘ ``` --- ## slurmctld: The Central Controller The brain of Slurm. Responsibilities: - Maintains cluster state (nodes, partitions, jobs) - Evaluates job requests against available resources - Runs the scheduling algorithm (backfill by default) - Dispatches jobs to compute nodes - Handles job state transitions (pending → running → completed) - Writes state checkpoints for crash recovery **High availability:** Configure a backup slurmctld on a second host. If the primary fails, the backup takes over automatically. ``` SlurmctldHost=mgmt01(10.0.0.1) # Primary SlurmctldHost=mgmt02(10.0.0.2) # Backup ``` --- ## slurmd: The Node Daemon Runs on every compute node. Responsibilities: - Reports node resources (CPUs, memory, GPUs) to slurmctld - Receives job launch requests from slurmctld - Spawns job steps via `slurmstepd` - Enforces resource limits (via cgroups) - Reports job status and resource usage back to slurmctld --- ## slurmdbd: The Accounting Daemon A dedicated daemon that mediates between slurmctld and the database: - Stores job records, associations (accounts/users), fairshare data, and resource limits - Buffers database writes for performance - Supports multiple clusters sharing one database - Backend: MySQL or MariaDB **Why a separate daemon?** Decouples slurmctld from the database. If the database is slow or temporarily down, slurmctld continues to function and slurmdbd queues writes. --- ## Authentication Slurm offers multiple authentication methods: ### Munge (Traditional Default) [MUNGE](https://dun.github.io/munge/) provides symmetric-key authentication. Every node shares a common key (`/etc/munge/munge.key`), and the `munged` daemon handles credential creation and validation. - **Pros:** Battle-tested, simple for on-prem clusters - **Cons:** Requires munge daemon on every node, key distribution management ### auth/slurm (Modern Alternative, 23.11+) Slurm's internal authentication plugin using HMAC SHA-256: - Shared key file: `/etc/slurm/slurm.key` - Simpler than Munge (no separate daemon) - Supports LDAP-less operation via credential extensions - Recommended for new deployments ### JWT (JSON Web Tokens) Used as `AuthAltType` alongside Munge or auth/slurm: - Primary use: REST API authentication via slurmrestd - Supports external identity providers (AWS Cognito, KeyCloak, Azure AD) via JWKS - RS256 (asymmetric) keys for external providers, HS256 for internal use - Not suitable for all operations (interactive `srun` not supported with JWT) ### sackd: Authentication on Login Nodes The `sackd` daemon runs on login/submit nodes to provide: - **Configless operation:** Fetches `slurm.conf` from slurmctld (no local config files needed) - Authentication services via Unix socket - TLS support for secure config retrieval - JWKS support for auth/slurm ```bash # Start sackd in configless mode sackd --conf-server mgmt01:6817 ``` --- ## Communication Model All Slurm daemons communicate over TCP: | Communication Path | Default Port | Protocol | |-------------------|-------------|----------| | Client → slurmctld | 6817 | Authenticated RPC | | slurmctld → slurmd | 6818 | Authenticated RPC | | slurmctld → slurmdbd | 6819 | Authenticated RPC | | Client → slurmrestd | 6820 (configurable) | HTTPS + JWT | --- ## State Management slurmctld maintains state in memory and periodically checkpoints to disk (`StateSaveLocation`). On restart: - `-r` (default): Recovers jobs and DOWN/DRAIN node states - `-R`: Full recovery including partition state and power-save settings - `-c`: Clear all state (destructive -- rarely used in production) --- ## Plugin Architecture Slurm's modularity comes from its plugin system. Key plugin types: | Plugin Type | Purpose | Common Values | |------------|---------|---------------| | `AuthType` | Authentication | `auth/munge`, `auth/slurm` | | `SelectType` | Resource selection | `select/cons_tres` (consumable TRES, recommended) | | `SchedulerType` | Scheduling algorithm | `sched/backfill` (default, recommended) | | `PriorityType` | Job prioritization | `priority/multifactor` (default) | | `AccountingStorageType` | Accounting backend | `accounting_storage/slurmdbd` | | `JobCompType` | Job completion logging | `jobcomp/filetxt`, `jobcomp/elasticsearch` | | `TaskPlugin` | Task management | `task/cgroup`, `task/affinity` | | `ProctrackType` | Process tracking | `proctrack/cgroup` | | `GresTypes` | Generic resources | `gpu` (most common) | --- ## Deployment Variants ### On-Premise You manage all daemons, database, and configuration. Full control, full responsibility. ### AWS ParallelCluster > **ParallelCluster Note:** The head node runs slurmctld, slurmdbd, and MariaDB on a single EC2 instance. There is no native HA for the controller -- if the head node is lost, it must be recreated. Size the head node instance appropriately (e.g., `c5.2xlarge` or larger for clusters >500 nodes). ParallelCluster manages slurmctld, slurmdbd, and the database on the head node. It adds: - Dynamic compute nodes that launch on demand (EC2 instances) - Power-saving daemon that terminates idle nodes - Custom slurm.conf parameters via `CustomSlurmSettings` - Some slurm.conf parameters are deny-listed (managed by ParallelCluster) ### AWS PCS > **PCS Note:** The Slurm control plane (slurmctld, slurmdbd, database) is fully managed by AWS and does not run in your account. You interact with it through the PCS API and console -- there is no SSH access to the controller. A fully managed Slurm control plane. You configure through the PCS API rather than editing slurm.conf directly. References¶ SchedMD: Slurm Overview SchedMD: Authentication Plugins SchedMD: JSON Web Tokens SchedMD: sackd man page SchedMD: "Configless" Slurm SchedMD: Quick Start Administrator Guide