Best Practices # Best Practices ## Resource Estimation ### Right-Size Your Jobs Over-requesting resources wastes cluster capacity and can increase your wait time (larger requests are harder to schedule). Under-requesting causes job failures. **The workflow:** 1. Run a small test with generous resources 2. Check actual usage with `sacct` or `seff` 3. Adjust your requests with ~20-50% headroom 4. Repeat as you scale up ```bash # Check actual usage of a completed job $ sacct -j 12345 --format=JobID,Elapsed,MaxRSS,TotalCPU,AllocCPUS,State # seff gives a nice summary (if installed) $ seff 12345 Job ID: 12345 State: COMPLETED (exit code 0) Cores: 16 CPU Utilized: 10:23:45 CPU Efficiency: 64.8% of 16:00:00 core-walltime Memory Utilized: 24.5 GB Memory Efficiency: 76.6% of 32.00 GB ``` ### Time Limits - **Too short:** Job gets killed (TIMEOUT), wasting all compute time spent - **Too long:** Reduces your priority for backfill scheduling - **Sweet spot:** Actual expected time + 30-50% buffer ```bash # Check runtimes of your recent similar jobs $ sacct --name=blast_run --format=JobID,Elapsed,State --starttime=now-30days ``` ### Memory - Check `MaxRSS` from `sacct` -- this is peak actual memory usage - Request 20-30% more than `MaxRSS` for safety - If you don't know, start with the node's per-CPU default and adjust --- ## Job Submission Practices ### Use Job Arrays, Not Loops ```bash # BAD for i in $(seq 1 500); do sbatch process.sh $i done # GOOD sbatch --array=1-500%50 process.sh ``` ### Use Dependencies for Pipelines ```bash # BAD: manual waiting and checking sbatch step1.sh # ... manually check, then ... sbatch step2.sh # GOOD: automated pipeline JOB1=$(sbatch --parsable step1.sh) sbatch --dependency=afterok:$JOB1 step2.sh ``` ### Name Your Jobs ```bash #SBATCH --job-name=blast_sampleA ``` Good job names make `squeue` output readable and `sacct` queries useful. ### Organize Output Files ```bash #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err ``` Create the `logs/` directory before submitting. Use `%x` (job name) and `%j` (job ID) for unique filenames. ### Set Mail Notifications Wisely ```bash #SBATCH --mail-type=END,FAIL # Notified when job ends or fails #SBATCH --mail-user=you@example.com ``` Don't use `--mail-type=ALL` for array jobs -- you'll get thousands of emails. --- ## Job Script Best Practices ### Make Scripts Self-Contained ```bash #!/bin/bash #SBATCH --job-name=my_analysis #SBATCH --time=04:00:00 #SBATCH --cpus-per-task=8 #SBATCH --mem=32G # Clean module environment module purge module load blast/2.15 # Move to working directory (explicit) cd $SLURM_SUBMIT_DIR # Use Slurm variables for thread count blastn -query input.fasta -db /data/nt \ -num_threads $SLURM_CPUS_PER_TASK \ -out results.txt ``` ### Use set -e for Fail-Fast ```bash #!/bin/bash #SBATCH ... set -euo pipefail # Exit on error, undefined vars, pipe failures module purge module load samtools/1.17 samtools sort input.bam -o sorted.bam samtools index sorted.bam ``` Without `set -e`, the script continues after failures, producing incomplete or wrong results. ### Don't Hardcode Thread Counts ```bash # BAD blastn -num_threads 16 ... # GOOD blastn -num_threads $SLURM_CPUS_PER_TASK ... ``` This way, changing `--cpus-per-task` automatically adjusts the program's thread count. --- ## Filesystem Best Practices ### Know Your Filesystems | Filesystem | Best For | Watch Out | |-----------|----------|-----------| | `/home` | Scripts, configs, small files | Often quota-limited (10-50 GB) | | `/scratch` or `/work` | Active job data, large files | May be purged periodically | | `/data` or `/shared` | Reference databases, shared data | Usually read-heavy | | Local SSD (`/tmp`, `/local`) | Temporary I/O-intensive work | Lost after job ends | ### Direct I/O to Appropriate Storage ```bash #!/bin/bash #SBATCH ... # Copy input to local scratch for fast I/O cp /data/large_input.bam /tmp/input.bam # Run analysis on local scratch samtools sort /tmp/input.bam -o /tmp/sorted.bam # Copy results back cp /tmp/sorted.bam $SLURM_SUBMIT_DIR/sorted.bam # Clean up rm /tmp/input.bam /tmp/sorted.bam ``` ### Don't Write Many Small Files to Shared Filesystems Shared filesystems (NFS, Lustre) struggle with millions of small files. If your workflow creates many temp files, use local storage. > **ParallelCluster / PCS Note (Cloud Cost):** On cloud clusters, **accurate `--time` limits directly affect cost**. Overly generous wall times keep dynamic nodes running (and billing) longer than necessary. Conversely, be aware that scale-to-zero clusters have a **startup delay** (1-3 minutes) when new nodes must launch -- factor this into your expectations but not your `--time` request (Slurm starts the clock after the node is ready). --- ## Being a Good Cluster Citizen ### Throttle Array Jobs ```bash #SBATCH --array=1-10000%50 # Max 50 concurrent ``` ### Don't Request Exclusive Unless Needed `--exclusive` prevents other jobs from sharing your node, even if you're only using 4 of 64 CPUs. ### Cancel Jobs You Don't Need ```bash # Don't leave broken or stalled jobs running $ scancel 12345 ``` ### Test Before Scaling 1. Run one job interactively to verify it works 2. Submit a small batch (5-10 jobs) 3. Check results and resource usage 4. Scale up with confidence --- ## Debugging Checklist When a job fails: 1. **Check output/error files:** `cat slurm-.out` 2. **Check exit code:** `sacct -j --format=JobID,State,ExitCode` 3. **Check resource usage:** Did it run out of memory (OOM) or time (TIMEOUT)? 4. **Test interactively:** `srun --pty bash`, then run the same commands 5. **Check the environment:** Are all modules loaded? Are paths correct? 6. **Check permissions:** Can the compute node access your files? --- ## Quick Reference: Common Mistakes | Mistake | Consequence | Fix | |---------|-------------|-----| | No `--time` | Uses partition default (may be short) | Always set `--time` | | `--ntasks=16` for threaded app | 16 copies of program, 1 CPU each | `--ntasks=1 --cpus-per-task=16` | | No `module purge` in script | Inconsistent environment | Add `module purge` at top | | Hardcoded thread count | Doesn't match allocation | Use `$SLURM_CPUS_PER_TASK` | | Submission loop instead of array | Overloads scheduler | Use `--array` | | Output dir doesn't exist | Job fails immediately | `mkdir -p logs/` before submit | | No `--mem` specified | Gets partition default (may be low) | Always set `--mem` | References¶ SchedMD: Quick Start User Guide SchedMD: FAQ SchedMD: Troubleshooting Guide SMU Slurm Best Practices