High Availability # High Availability ## Overview Production Slurm clusters need resilience against failures. Slurm provides HA mechanisms for its critical components: - **slurmctld** -- primary/backup failover - **slurmdbd** -- database HA via MySQL replication or clustering - **State persistence** -- checkpoint/recovery for crash survival --- ## slurmctld High Availability ### Backup Controller Configure a second host as backup slurmctld: ```ini # slurm.conf SlurmctldHost=mgmt01(10.0.0.1) # Primary SlurmctldHost=mgmt02(10.0.0.2) # Backup StateSaveLocation=/shared/slurmctld # Must be accessible to both hosts SlurmctldTimeout=120 # Seconds before backup takes over ``` Both hosts run slurmctld. The backup operates in standby mode, monitoring the primary via heartbeat. If the primary fails to respond within `SlurmctldTimeout`, the backup assumes the role. ### Requirements - **Shared state directory:** `StateSaveLocation` must be on a filesystem accessible to both slurmctld hosts (NFS, shared block storage, DRBD) - **Network:** Both hosts must be reachable by slurmd on all compute nodes - **Configuration:** Identical `slurm.conf` on both hosts ### Failover Behavior 1. Primary slurmctld stops responding 2. After `SlurmctldTimeout` seconds, backup detects the failure 3. Backup reads the latest state checkpoint from `StateSaveLocation` 4. Backup becomes the active controller 5. Compute nodes (slurmd) automatically redirect communication to the backup 6. Running jobs continue unaffected 7. Pending jobs resume scheduling ### Manual Failover ```bash # On the primary, gracefully step down systemctl stop slurmctld # Backup takes over automatically # Verify: scontrol ping # Slurmctld(primary) at mgmt02 is UP ``` ### Failback When the primary is repaired: ```bash # Start slurmctld on the primary systemctl start slurmctld # It starts as backup # Optional: force failback to primary (requires restarting backup) # This is often left as-is until the next maintenance window ``` --- ## SlurmDBD High Availability slurmdbd depends on MySQL/MariaDB. HA options: ### Option 1: MySQL Replication - Primary MySQL for writes - Read replica for failover - `AccountingStorageBackupHost` in slurm.conf points to the replica ```ini # slurm.conf AccountingStorageHost=db01 AccountingStorageBackupHost=db02 ``` ### Option 2: MySQL/MariaDB Galera Cluster Multi-master synchronous replication: - All nodes can accept writes - Point slurmdbd at a load balancer or virtual IP - True HA with no manual failover ### Option 3: slurmdbd Buffering slurmdbd is not strictly required for Slurm to function. If slurmdbd is down: - slurmctld **buffers accounting data** in memory - Jobs continue to run and be scheduled - When slurmdbd comes back, buffered data is flushed - Fair share may be slightly stale during the outage This built-in resilience means a brief slurmdbd outage doesn't impact operations. --- ## State Directory Considerations The `StateSaveLocation` directory is critical: - **Fast storage:** SSD or high-performance NFS mount - **Reliable:** RAID, replicated, or on a highly available filesystem - **Not too large:** Typically a few hundred MB to a few GB - **Backed up regularly** ```bash # Monitor state directory size du -sh /var/spool/slurmctld/ # Regular backup rsync -a /var/spool/slurmctld/ /backup/slurmctld_state/ ``` ### Local vs. Shared State | Approach | Pros | Cons | |----------|------|------| | Shared NFS | Simple HA failover | NFS is a dependency | | Local + DRBD | No NFS dependency | More complex setup | | Local only | Simplest | No HA for slurmctld | --- ## Testing Failover Regularly test your HA setup: ```bash # 1. Verify both controllers are configured scontrol show config | grep SlurmctldHost # 2. Check which is active scontrol ping # 3. Simulate failure (on primary) systemctl stop slurmctld # 4. Verify failover scontrol ping # Should show backup as primary sinfo # Should work normally sbatch test.sh # Should succeed # 5. Restore primary systemctl start slurmctld # 6. Verify it came back as backup scontrol ping ``` --- ## Compute Node Resilience ### ReturnToService ```ini # slurm.conf ReturnToService=2 ``` | Value | Behavior | |-------|----------| | 0 | Node stays DOWN until admin intervention | | 1 | Node returns if it re-registers with same or more resources | | 2 | Node returns if it re-registers with any resources (recommended) | `ReturnToService=2` allows nodes that reboot (e.g., after a kernel panic) to automatically rejoin the cluster. ### HealthCheck Run periodic health checks on compute nodes: ```ini # slurm.conf HealthCheckProgram=/etc/slurm/health_check.sh HealthCheckInterval=600 # Every 10 minutes HealthCheckNodeState=ANY ``` Example health check script: ```bash #!/bin/bash # /etc/slurm/health_check.sh # Exit 0 = healthy, non-zero = drain the node # Check GPU health if command -v nvidia-smi &>/dev/null; then nvidia-smi > /dev/null 2>&1 || exit 1 fi # Check filesystem mountpoint -q /data || exit 1 # Check memory free_mb=$(free -m | awk '/Mem:/ {print $4}') [ "$free_mb" -lt 1024 ] && exit 1 exit 0 ``` > **ParallelCluster Note:** HA for slurmctld is not natively supported in ParallelCluster. The head node is a single point of failure. For production workloads, consider instance recovery (EC2 auto-recovery) and regular EBS snapshots. References¶ SchedMD: Quick Start Administrator Guide -- Failover SchedMD: slurm.conf -- SlurmctldHost SchedMD: Accounting and Resource Limits