Maintenance & Operations # Maintenance & Operations ## Draining Nodes for Maintenance ### Drain Individual Nodes ```bash # Drain a node (existing jobs finish, no new jobs start) scontrol update NodeName=node003 State=DRAIN Reason="memory DIMM replacement ETA 2pm" # Check status scontrol show node node003 | grep State State=DRAINING # Jobs still running State=DRAINED # All jobs finished # After maintenance, return to service scontrol update NodeName=node003 State=RESUME ``` ### Drain Multiple Nodes ```bash # Drain a range scontrol update NodeName=node[010-020] State=DRAIN Reason="firmware update" # Drain an entire partition for node in $(sinfo -p gpu -h -N -o "%N"); do scontrol update NodeName=$node State=DRAIN Reason="GPU driver update" done ``` ### Reservations for Planned Maintenance A reservation prevents new jobs from being scheduled on the resources: ```bash # Create a maintenance reservation starting in 4 hours scontrol create reservation ReservationName=maint_window \ StartTime=now+4hours Duration=180 \ Nodes=node[001-048] User=root Flags=MAINT # Jobs already running will continue; no new ones will start # After maintenance: scontrol delete ReservationName=maint_window scontrol update NodeName=node[001-048] State=RESUME ``` --- ## Slurm Upgrades ### Rolling Upgrade (Preferred) Slurm supports upgrading one major version at a time (e.g., 24.05 → 24.11 → 25.05). The typical order: 1. **Upgrade slurmdbd first** (backward compatible with older slurmctld) 2. **Upgrade slurmctld** (can manage older slurmd nodes temporarily) 3. **Upgrade slurmd** on compute nodes (can be done in rolling fashion) 4. **Upgrade client tools** on login nodes ```bash # 1. slurmdbd host systemctl stop slurmdbd yum update slurm-slurmdbd systemctl start slurmdbd # 2. slurmctld host systemctl stop slurmctld yum update slurm-slurmctld systemctl start slurmctld # 3. Compute nodes (rolling, one group at a time) scontrol update NodeName=node[001-010] State=DRAIN Reason="Slurm upgrade" # Wait for DRAINED state ansible node[001-010] -m yum -a "name=slurm-slurmd state=latest" ansible node[001-010] -m systemd -a "name=slurmd state=restarted" scontrol update NodeName=node[001-010] State=RESUME # Repeat for next group ``` ### Pre-Upgrade Checklist - [ ] Back up SlurmDBD database: `mysqldump slurm_acct_db > backup.sql` - [ ] Back up state: `cp -r /var/spool/slurmctld /var/spool/slurmctld.bak` - [ ] Back up config files: `tar czf slurm_config_backup.tar.gz /etc/slurm/` - [ ] Read the [release notes](https://slurm.schedmd.com/news.html) for deprecations - [ ] Test on a non-production node first --- ## Backup and Recovery ### SlurmDBD Database Backup ```bash # Daily database backup (add to cron or scrontab) mysqldump --single-transaction slurm_acct_db | gzip > /backup/slurm_db_$(date +%Y%m%d).sql.gz # Retain 30 days find /backup/ -name "slurm_db_*.sql.gz" -mtime +30 -delete ``` ### State Directory Backup ```bash # Back up slurmctld state (while running is OK) rsync -a /var/spool/slurmctld/ /backup/slurmctld_state/ ``` ### Recovery After slurmctld Crash slurmctld recovers automatically from its state checkpoint: ```bash # Normal recovery (default -r flag): restores jobs and DOWN/DRAIN node states systemctl start slurmctld # Full recovery (-R): also restores partition states and power settings slurmctld -R ``` ### Recovery After Database Loss If slurmdbd's database is lost, slurmctld will buffer accounting records. Restore the database and slurmdbd will catch up: ```bash mysql slurm_acct_db < backup.sql systemctl restart slurmdbd ``` --- ## Configuration Versioning Track all config changes in git: ```bash # Initial setup cd /etc/slurm git init git add slurm.conf slurmdbd.conf cgroup.conf gres.conf git commit -m "Initial Slurm configuration" # After changes git diff git add -A git commit -m "Added gpu partition, increased batch MaxTime" ``` --- ## Routine Operational Tasks ### Check for Stuck Jobs ```bash # Jobs running longer than expected sacct --starttime=now-7days --state=RUNNING --format=JobID,User,Elapsed,TimeLimit,NodeList \ | awk 'NR>2 && $3 > $4 {print}' ``` ### Clear Error States ```bash # Find nodes in error state sinfo -t error -N -o "%N %E" # After investigating and fixing: scontrol update NodeName=node005 State=RESUME ``` ### Monitor Filesystem Usage ```bash # Check that key filesystems aren't full df -h /var/spool/slurmctld # State directory df -h /var/log/slurm # Log directory df -h /tmp # Compute node temp ``` ### Audit User Access ```bash # List all users and their accounts/QOS sacctmgr show association format=User,Account,QOS,Fairshare,MaxJobs # Find users with no recent activity sacctmgr show user --format=User,DefaultAccount | while read user acct; do count=$(sacct --user=$user --starttime=now-90days --noheader | wc -l) [ "$count" -eq 0 ] && echo "Inactive: $user ($acct)" done ``` > **ParallelCluster Note:** ParallelCluster manages node lifecycle (launch, terminate) automatically. Don't drain or down dynamic nodes manually -- let the power-saving daemon handle idle node termination. For Slurm upgrades, create a new cluster with the target version and migrate workloads -- in-place upgrades are not supported. > **PCS Note:** Slurm upgrades are managed automatically by AWS. PCS handles control-plane patching transparently. Compute node AMIs can be updated by modifying the compute node group configuration and rolling the fleet. References¶ SchedMD: Upgrade Guide SchedMD: Quick Start Administrator Guide SchedMD: scontrol man page