High Availability

# High Availability

## Overview

Production Slurm clusters need resilience against failures. Slurm provides HA mechanisms for its critical components:

- **slurmctld** -- primary/backup failover
- **slurmdbd** -- database HA via MySQL replication or clustering
- **State persistence** -- checkpoint/recovery for crash survival

---

## slurmctld High Availability

### Backup Controller

Configure a second host as backup slurmctld:

```ini
# slurm.conf
SlurmctldHost=mgmt01(10.0.0.1)    # Primary
SlurmctldHost=mgmt02(10.0.0.2)    # Backup

StateSaveLocation=/shared/slurmctld    # Must be accessible to both hosts
SlurmctldTimeout=120                    # Seconds before backup takes over
```

Both hosts run slurmctld. The backup operates in standby mode, monitoring the primary via heartbeat. If the primary fails to respond within `SlurmctldTimeout`, the backup assumes the role.

### Requirements

- **Shared state directory:** `StateSaveLocation` must be on a filesystem accessible to both slurmctld hosts (NFS, shared block storage, DRBD)
- **Network:** Both hosts must be reachable by slurmd on all compute nodes
- **Configuration:** Identical `slurm.conf` on both hosts

### Failover Behavior

1. Primary slurmctld stops responding
2. After `SlurmctldTimeout` seconds, backup detects the failure
3. Backup reads the latest state checkpoint from `StateSaveLocation`
4. Backup becomes the active controller
5. Compute nodes (slurmd) automatically redirect communication to the backup
6. Running jobs continue unaffected
7. Pending jobs resume scheduling

### Manual Failover

```bash
# On the primary, gracefully step down
systemctl stop slurmctld

# Backup takes over automatically
# Verify:
scontrol ping
# Slurmctld(primary) at mgmt02 is UP
```

### Failback

When the primary is repaired:

```bash
# Start slurmctld on the primary
systemctl start slurmctld
# It starts as backup

# Optional: force failback to primary (requires restarting backup)
# This is often left as-is until the next maintenance window
```

---

## SlurmDBD High Availability

slurmdbd depends on MySQL/MariaDB. HA options:

### Option 1: MySQL Replication

- Primary MySQL for writes
- Read replica for failover
- `AccountingStorageBackupHost` in slurm.conf points to the replica

```ini
# slurm.conf
AccountingStorageHost=db01
AccountingStorageBackupHost=db02
```

### Option 2: MySQL/MariaDB Galera Cluster

Multi-master synchronous replication:

- All nodes can accept writes
- Point slurmdbd at a load balancer or virtual IP
- True HA with no manual failover

### Option 3: slurmdbd Buffering

slurmdbd is not strictly required for Slurm to function. If slurmdbd is down:

- slurmctld **buffers accounting data** in memory
- Jobs continue to run and be scheduled
- When slurmdbd comes back, buffered data is flushed
- Fair share may be slightly stale during the outage

This built-in resilience means a brief slurmdbd outage doesn't impact operations.

---

## State Directory Considerations

The `StateSaveLocation` directory is critical:

- **Fast storage:** SSD or high-performance NFS mount
- **Reliable:** RAID, replicated, or on a highly available filesystem
- **Not too large:** Typically a few hundred MB to a few GB
- **Backed up regularly**

```bash
# Monitor state directory size
du -sh /var/spool/slurmctld/

# Regular backup
rsync -a /var/spool/slurmctld/ /backup/slurmctld_state/
```

### Local vs. Shared State

| Approach | Pros | Cons |
|----------|------|------|
| Shared NFS | Simple HA failover | NFS is a dependency |
| Local + DRBD | No NFS dependency | More complex setup |
| Local only | Simplest | No HA for slurmctld |

---

## Testing Failover

Regularly test your HA setup:

```bash
# 1. Verify both controllers are configured
scontrol show config | grep SlurmctldHost

# 2. Check which is active
scontrol ping

# 3. Simulate failure (on primary)
systemctl stop slurmctld

# 4. Verify failover
scontrol ping    # Should show backup as primary
sinfo            # Should work normally
sbatch test.sh   # Should succeed

# 5. Restore primary
systemctl start slurmctld

# 6. Verify it came back as backup
scontrol ping
```

---

## Compute Node Resilience

### ReturnToService

```ini
# slurm.conf
ReturnToService=2
```

| Value | Behavior |
|-------|----------|
| 0 | Node stays DOWN until admin intervention |
| 1 | Node returns if it re-registers with same or more resources |
| 2 | Node returns if it re-registers with any resources (recommended) |

`ReturnToService=2` allows nodes that reboot (e.g., after a kernel panic) to automatically rejoin the cluster.

### HealthCheck

Run periodic health checks on compute nodes:

```ini
# slurm.conf
HealthCheckProgram=/etc/slurm/health_check.sh
HealthCheckInterval=600        # Every 10 minutes
HealthCheckNodeState=ANY
```

Example health check script:

```bash
#!/bin/bash
# /etc/slurm/health_check.sh
# Exit 0 = healthy, non-zero = drain the node

# Check GPU health
if command -v nvidia-smi &>/dev/null; then
    nvidia-smi > /dev/null 2>&1 || exit 1
fi

# Check filesystem
mountpoint -q /data || exit 1

# Check memory
free_mb=$(free -m | awk '/Mem:/ {print $4}')
[ "$free_mb" -lt 1024 ] && exit 1

exit 0
```

> **ParallelCluster Note:** HA for slurmctld is not natively supported in ParallelCluster. The head node is a single point of failure. For production workloads, consider instance recovery (EC2 auto-recovery) and regular EBS snapshots.

High Availability

References¶