Best Practices

# Best Practices

## Resource Estimation

### Right-Size Your Jobs

Over-requesting resources wastes cluster capacity and can increase your wait time (larger requests are harder to schedule). Under-requesting causes job failures.

**The workflow:**

1. Run a small test with generous resources
2. Check actual usage with `sacct` or `seff`
3. Adjust your requests with ~20-50% headroom
4. Repeat as you scale up

```bash
# Check actual usage of a completed job
$ sacct -j 12345 --format=JobID,Elapsed,MaxRSS,TotalCPU,AllocCPUS,State

# seff gives a nice summary (if installed)
$ seff 12345
Job ID: 12345
State: COMPLETED (exit code 0)
Cores: 16
CPU Utilized: 10:23:45
CPU Efficiency: 64.8% of 16:00:00 core-walltime
Memory Utilized: 24.5 GB
Memory Efficiency: 76.6% of 32.00 GB
```

### Time Limits

- **Too short:** Job gets killed (TIMEOUT), wasting all compute time spent
- **Too long:** Reduces your priority for backfill scheduling
- **Sweet spot:** Actual expected time + 30-50% buffer

```bash
# Check runtimes of your recent similar jobs
$ sacct --name=blast_run --format=JobID,Elapsed,State --starttime=now-30days
```

### Memory

- Check `MaxRSS` from `sacct` -- this is peak actual memory usage
- Request 20-30% more than `MaxRSS` for safety
- If you don't know, start with the node's per-CPU default and adjust

---

## Job Submission Practices

### Use Job Arrays, Not Loops

```bash
# BAD
for i in $(seq 1 500); do
    sbatch process.sh $i
done

# GOOD
sbatch --array=1-500%50 process.sh
```

### Use Dependencies for Pipelines

```bash
# BAD: manual waiting and checking
sbatch step1.sh
# ... manually check, then ...
sbatch step2.sh

# GOOD: automated pipeline
JOB1=$(sbatch --parsable step1.sh)
sbatch --dependency=afterok:$JOB1 step2.sh
```

### Name Your Jobs

```bash
#SBATCH --job-name=blast_sampleA
```

Good job names make `squeue` output readable and `sacct` queries useful.

### Organize Output Files

```bash
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
```

Create the `logs/` directory before submitting. Use `%x` (job name) and `%j` (job ID) for unique filenames.

### Set Mail Notifications Wisely

```bash
#SBATCH --mail-type=END,FAIL       # Notified when job ends or fails
#SBATCH --mail-user=you@example.com
```

Don't use `--mail-type=ALL` for array jobs -- you'll get thousands of emails.

---

## Job Script Best Practices

### Make Scripts Self-Contained

```bash
#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

# Clean module environment
module purge
module load blast/2.15

# Move to working directory (explicit)
cd $SLURM_SUBMIT_DIR

# Use Slurm variables for thread count
blastn -query input.fasta -db /data/nt \
    -num_threads $SLURM_CPUS_PER_TASK \
    -out results.txt
```

### Use set -e for Fail-Fast

```bash
#!/bin/bash
#SBATCH ...

set -euo pipefail    # Exit on error, undefined vars, pipe failures

module purge
module load samtools/1.17

samtools sort input.bam -o sorted.bam
samtools index sorted.bam
```

Without `set -e`, the script continues after failures, producing incomplete or wrong results.

### Don't Hardcode Thread Counts

```bash
# BAD
blastn -num_threads 16 ...

# GOOD
blastn -num_threads $SLURM_CPUS_PER_TASK ...
```

This way, changing `--cpus-per-task` automatically adjusts the program's thread count.

---

## Filesystem Best Practices

### Know Your Filesystems

| Filesystem | Best For | Watch Out |
|-----------|----------|-----------|
| `/home` | Scripts, configs, small files | Often quota-limited (10-50 GB) |
| `/scratch` or `/work` | Active job data, large files | May be purged periodically |
| `/data` or `/shared` | Reference databases, shared data | Usually read-heavy |
| Local SSD (`/tmp`, `/local`) | Temporary I/O-intensive work | Lost after job ends |

### Direct I/O to Appropriate Storage

```bash
#!/bin/bash
#SBATCH ...

# Copy input to local scratch for fast I/O
cp /data/large_input.bam /tmp/input.bam

# Run analysis on local scratch
samtools sort /tmp/input.bam -o /tmp/sorted.bam

# Copy results back
cp /tmp/sorted.bam $SLURM_SUBMIT_DIR/sorted.bam

# Clean up
rm /tmp/input.bam /tmp/sorted.bam
```

### Don't Write Many Small Files to Shared Filesystems

Shared filesystems (NFS, Lustre) struggle with millions of small files. If your workflow creates many temp files, use local storage.

> **ParallelCluster / PCS Note (Cloud Cost):** On cloud clusters, **accurate `--time` limits directly affect cost**. Overly generous wall times keep dynamic nodes running (and billing) longer than necessary. Conversely, be aware that scale-to-zero clusters have a **startup delay** (1-3 minutes) when new nodes must launch -- factor this into your expectations but not your `--time` request (Slurm starts the clock after the node is ready).

---

## Being a Good Cluster Citizen

### Throttle Array Jobs

```bash
#SBATCH --array=1-10000%50    # Max 50 concurrent
```

### Don't Request Exclusive Unless Needed

`--exclusive` prevents other jobs from sharing your node, even if you're only using 4 of 64 CPUs.

### Cancel Jobs You Don't Need

```bash
# Don't leave broken or stalled jobs running
$ scancel 12345
```

### Test Before Scaling

1. Run one job interactively to verify it works
2. Submit a small batch (5-10 jobs)
3. Check results and resource usage
4. Scale up with confidence

---

## Debugging Checklist

When a job fails:

1. **Check output/error files:** `cat slurm-<jobid>.out`
2. **Check exit code:** `sacct -j <jobid> --format=JobID,State,ExitCode`
3. **Check resource usage:** Did it run out of memory (OOM) or time (TIMEOUT)?
4. **Test interactively:** `srun --pty bash`, then run the same commands
5. **Check the environment:** Are all modules loaded? Are paths correct?
6. **Check permissions:** Can the compute node access your files?

---

## Quick Reference: Common Mistakes

| Mistake | Consequence | Fix |
|---------|-------------|-----|
| No `--time` | Uses partition default (may be short) | Always set `--time` |
| `--ntasks=16` for threaded app | 16 copies of program, 1 CPU each | `--ntasks=1 --cpus-per-task=16` |
| No `module purge` in script | Inconsistent environment | Add `module purge` at top |
| Hardcoded thread count | Doesn't match allocation | Use `$SLURM_CPUS_PER_TASK` |
| Submission loop instead of array | Overloads scheduler | Use `--array` |
| Output dir doesn't exist | Job fails immediately | `mkdir -p logs/` before submit |
| No `--mem` specified | Gets partition default (may be low) | Always set `--mem` |

Best Practices

References¶