Maintenance & Operations

# Maintenance & Operations

## Draining Nodes for Maintenance

### Drain Individual Nodes

```bash
# Drain a node (existing jobs finish, no new jobs start)
scontrol update NodeName=node003 State=DRAIN Reason="memory DIMM replacement ETA 2pm"

# Check status
scontrol show node node003 | grep State
   State=DRAINING    # Jobs still running
   State=DRAINED     # All jobs finished

# After maintenance, return to service
scontrol update NodeName=node003 State=RESUME
```

### Drain Multiple Nodes

```bash
# Drain a range
scontrol update NodeName=node[010-020] State=DRAIN Reason="firmware update"

# Drain an entire partition
for node in $(sinfo -p gpu -h -N -o "%N"); do
    scontrol update NodeName=$node State=DRAIN Reason="GPU driver update"
done
```

### Reservations for Planned Maintenance

A reservation prevents new jobs from being scheduled on the resources:

```bash
# Create a maintenance reservation starting in 4 hours
scontrol create reservation ReservationName=maint_window \
    StartTime=now+4hours Duration=180 \
    Nodes=node[001-048] User=root Flags=MAINT

# Jobs already running will continue; no new ones will start
# After maintenance:
scontrol delete ReservationName=maint_window
scontrol update NodeName=node[001-048] State=RESUME
```

---

## Slurm Upgrades

### Rolling Upgrade (Preferred)

Slurm supports upgrading one major version at a time (e.g., 24.05 → 24.11 → 25.05). The typical order:

1. **Upgrade slurmdbd first** (backward compatible with older slurmctld)
2. **Upgrade slurmctld** (can manage older slurmd nodes temporarily)
3. **Upgrade slurmd** on compute nodes (can be done in rolling fashion)
4. **Upgrade client tools** on login nodes

```bash
# 1. slurmdbd host
systemctl stop slurmdbd
yum update slurm-slurmdbd
systemctl start slurmdbd

# 2. slurmctld host
systemctl stop slurmctld
yum update slurm-slurmctld
systemctl start slurmctld

# 3. Compute nodes (rolling, one group at a time)
scontrol update NodeName=node[001-010] State=DRAIN Reason="Slurm upgrade"
# Wait for DRAINED state
ansible node[001-010] -m yum -a "name=slurm-slurmd state=latest"
ansible node[001-010] -m systemd -a "name=slurmd state=restarted"
scontrol update NodeName=node[001-010] State=RESUME
# Repeat for next group
```

### Pre-Upgrade Checklist

- [ ] Back up SlurmDBD database: `mysqldump slurm_acct_db > backup.sql`
- [ ] Back up state: `cp -r /var/spool/slurmctld /var/spool/slurmctld.bak`
- [ ] Back up config files: `tar czf slurm_config_backup.tar.gz /etc/slurm/`
- [ ] Read the [release notes](https://slurm.schedmd.com/news.html) for deprecations
- [ ] Test on a non-production node first

---

## Backup and Recovery

### SlurmDBD Database Backup

```bash
# Daily database backup (add to cron or scrontab)
mysqldump --single-transaction slurm_acct_db | gzip > /backup/slurm_db_$(date +%Y%m%d).sql.gz

# Retain 30 days
find /backup/ -name "slurm_db_*.sql.gz" -mtime +30 -delete
```

### State Directory Backup

```bash
# Back up slurmctld state (while running is OK)
rsync -a /var/spool/slurmctld/ /backup/slurmctld_state/
```

### Recovery After slurmctld Crash

slurmctld recovers automatically from its state checkpoint:

```bash
# Normal recovery (default -r flag): restores jobs and DOWN/DRAIN node states
systemctl start slurmctld

# Full recovery (-R): also restores partition states and power settings
slurmctld -R
```

### Recovery After Database Loss

If slurmdbd's database is lost, slurmctld will buffer accounting records. Restore the database and slurmdbd will catch up:

```bash
mysql slurm_acct_db < backup.sql
systemctl restart slurmdbd
```

---

## Configuration Versioning

Track all config changes in git:

```bash
# Initial setup
cd /etc/slurm
git init
git add slurm.conf slurmdbd.conf cgroup.conf gres.conf
git commit -m "Initial Slurm configuration"

# After changes
git diff
git add -A
git commit -m "Added gpu partition, increased batch MaxTime"
```

---

## Routine Operational Tasks

### Check for Stuck Jobs

```bash
# Jobs running longer than expected
sacct --starttime=now-7days --state=RUNNING --format=JobID,User,Elapsed,TimeLimit,NodeList \
    | awk 'NR>2 && $3 > $4 {print}'
```

### Clear Error States

```bash
# Find nodes in error state
sinfo -t error -N -o "%N %E"

# After investigating and fixing:
scontrol update NodeName=node005 State=RESUME
```

### Monitor Filesystem Usage

```bash
# Check that key filesystems aren't full
df -h /var/spool/slurmctld    # State directory
df -h /var/log/slurm          # Log directory
df -h /tmp                    # Compute node temp
```

### Audit User Access

```bash
# List all users and their accounts/QOS
sacctmgr show association format=User,Account,QOS,Fairshare,MaxJobs

# Find users with no recent activity
sacctmgr show user --format=User,DefaultAccount | while read user acct; do
    count=$(sacct --user=$user --starttime=now-90days --noheader | wc -l)
    [ "$count" -eq 0 ] && echo "Inactive: $user ($acct)"
done
```

> **ParallelCluster Note:** ParallelCluster manages node lifecycle (launch, terminate) automatically. Don't drain or down dynamic nodes manually -- let the power-saving daemon handle idle node termination. For Slurm upgrades, create a new cluster with the target version and migrate workloads -- in-place upgrades are not supported.

> **PCS Note:** Slurm upgrades are managed automatically by AWS. PCS handles control-plane patching transparently. Compute node AMIs can be updated by modifying the compute node group configuration and rolling the fleet.

Maintenance & Operations

References¶