Training Tracks Guide¶
This guide organizes the 44 training modules into recommended reading paths by audience. Each track is self-contained -- start with the track that matches your role and add modules from other tracks as needed.
Audience Levels¶
| Level | Role | Description |
|---|---|---|
| L1 | Curious End-User | "What is this scheduler thing and why should I care?" |
| L2 | Working End-User | "I need to submit jobs and get work done" |
| L3 | Power User | "I want to optimize my workflows and use advanced features" |
| L4 | Administrator | "I need to install, configure, and manage Slurm" |
| L5 | IT Leadership | "I need to understand capacity, cost, and strategy" |
Track 1: Getting Started (L1-L2)¶
For users new to HPC or Slurm. Start here if you have never used a job scheduler before.
| # | Module | What You'll Learn |
|---|---|---|
| 1 | What is HPC Scheduling? | Why clusters need a scheduler, the "contract" between you and the system |
| 2 | Slurm Overview | Key concepts: nodes, partitions, jobs, steps |
| 3 | Getting Started | Log in, write your first job script, submit with sbatch |
| 4 | Submitting Jobs | sbatch options, directives, output files, --wrap |
| 5 | Monitoring Jobs | squeue, scontrol show job, sacct, sinfo |
| 6 | Managing Jobs | scancel, hold/release, modify pending jobs |
After this track: You can submit jobs, check their status, and manage them. Continue to Track 2 for resource management and advanced features.
Quick reference: Command Cheatsheet | Environment Variables
Track 2: End-User Essentials (L2)¶
For working users who submit jobs regularly. Builds on Track 1.
| # | Module | What You'll Learn |
|---|---|---|
| 1-6 | Track 1 modules | (prerequisite) |
| 7 | Resource Requests | --mem, --cpus-per-task, --time, --exclusive, GRES |
| 8 | Interactive Jobs | srun, salloc, X11 forwarding, Jupyter on compute nodes |
| 9 | Environment Modules | module load/unload, Lmod, using modules in job scripts |
| 10 | Containers on Slurm | Singularity/Apptainer, OCI containers, --container |
| 11 | Best Practices | Resource estimation, efficiency, being a good cluster citizen |
After this track: You can write efficient job scripts, request appropriate resources, and use the cluster effectively.
Quick reference: Job State Codes
Track 3: Power User (L3)¶
For users running complex workflows, MPI jobs, GPU workloads, and automated pipelines. Builds on Track 2.
| # | Module | What You'll Learn |
|---|---|---|
| 1-11 | Track 2 modules | (prerequisite) |
| 12 | Job Arrays | --array, SLURM_ARRAY_TASK_ID, parameter sweeps, throttling |
| 13 | Job Dependencies | --dependency, afterok/afterany, building pipelines |
| 14 | Parallel & MPI Jobs | --ntasks vs --cpus-per-task, srun as MPI launcher, hybrid jobs |
| 15 | GPU Jobs | --gres=gpu, GPU types, multi-GPU, CUDA_VISIBLE_DEVICES |
| 16 | Recurring Jobs (scrontab) | Slurm's built-in cron for scheduled cluster jobs |
After this track: You can build complex multi-step pipelines, run MPI and GPU workloads, and automate recurring analysis.
Track 4: Administrator (L4)¶
For cluster administrators. This is the most comprehensive track, covering Slurm installation through production operations.
Core Admin Modules¶
| # | Module | What You'll Learn |
|---|---|---|
| 1 | Slurm Architecture | Daemons, communication model, authentication options |
| 2 | Installation | Prerequisites, Munge/auth/slurm, packages, MariaDB, slurmdbd |
| 3 | Configuration | slurm.conf, slurmdbd.conf, cgroup.conf, node definitions |
| 4 | Partitions & QOS | Partition design, QOS limits, preemption, access control |
| 5 | Accounts & Fairshare | sacctmgr, account hierarchy, fairshare algorithm, shares |
| 6 | Resource Management | GRES (GPUs, licenses), consumable resources, cgroup enforcement |
| 7 | Monitoring & Accounting | sacct, sreport, sdiag, Prometheus/Grafana integration |
| 8 | Policies & Priority | Multifactor priority, preemption, backfill, reservations |
| 9 | Troubleshooting | Pending job diagnosis, node states, log files, common problems |
| 10 | Maintenance & Operations | Draining, upgrades, backups, config versioning |
| 11 | High Availability | slurmctld failover, slurmdbd HA, state directory, testing |
Deployment Modules (choose your platform)¶
| Module | When to Read |
|---|---|
| On-Premise Deployment | Deploying Slurm on bare-metal or VM infrastructure |
| AWS ParallelCluster | Self-managed Slurm on AWS with dynamic scaling |
| AWS PCS | AWS-managed Slurm (minimal admin overhead) |
Read the deployment module(s) that match your environment. Most admins should also skim the other deployment modules for cross-platform awareness.
Quick Reference¶
| Reference | Use For |
|---|---|
| Command Cheatsheet | Quick command lookup |
| Environment Variables | SLURM_* variables in job scripts |
| Job State Codes | Decoding job states (PD, R, CG, F, etc.) |
| Node State Codes | Decoding node states (idle, alloc, drain, down) |
Recommendation: Admins should also complete Track 1-3 (user modules) to understand the end-user experience on the cluster they manage.
Track 5: IT Leadership (L5)¶
For managers, directors, and decision-makers evaluating or overseeing HPC infrastructure. No command-line prerequisites.
| # | Module | What You'll Learn |
|---|---|---|
| 1 | What is HPC Scheduling? | Why shared computing needs a scheduler (non-technical overview) |
| 2 | Why Slurm | Market position, GPU support, cloud integration, cost model, talent pool |
| 3 | Slurm Architecture | Technical architecture at a glance (skim for context) |
| 4 | Capacity Planning | Utilization metrics, sreport, planning strategies, cloud bursting economics |
| 5 | Cost Allocation | Chargeback/showback, account hierarchy, QOS budgets, cloud cost attribution |
| 6 | Accounts & Fairshare | How resource allocation and fairshare work (skim for policy context) |
| 7 | Policies & Priority | Scheduling policies, preemption, reservations (skim for policy context) |
After this track: You can make informed decisions about HPC infrastructure investments, evaluate Slurm vs. alternatives, and understand the reporting tools available for cost management.
Application Tracks¶
These tracks are for organizations running specific life science applications on Slurm. Each has an admin module (setup/configuration) and a user module (daily use).
Schrodinger Suite + Slurm¶
| # | Module | Audience | What You'll Learn |
|---|---|---|---|
| 1 | Schrodinger Admin Setup | L4 | SLM licensing, hosts file/hosts.yml, GPU config, Job Server, license-aware scheduling |
| 2 | Schrodinger User Guide | L2-L3 | -HOST flag, -NPROC, Maestro submission, monitoring, troubleshooting |
Prerequisites: Track 1-2 (user basics) for the user module, Track 4 (admin) for the admin module.
CryoSPARC + Slurm¶
| # | Module | Audience | What You'll Learn |
|---|---|---|---|
| 1 | CryoSPARC Admin Setup | L4 | Cluster lanes, cluster_info.json, cluster_script.sh, GPU management, CryoSPARC Live |
| 2 | CryoSPARC User Guide | L2-L3 | Lane selection, resource settings, monitoring, failure diagnosis, SSD caching |
Prerequisites: Track 1-2 (user basics) for the user module, Track 4 (admin, especially GPU and GRES) for the admin module.
Migration Tracks¶
For users transitioning from another scheduler to Slurm. Each guide provides command mapping tables, job script translations, and behavioral differences.
| Source Scheduler | Module | Focus |
|---|---|---|
| Sun Grid Engine (SGE) | SGE to Slurm | qsub/sbatch, PE/--ntasks, queues/partitions |
| PBS/Torque | PBS to Slurm | qsub/sbatch, PBS directives to #SBATCH |
| IBM LSF | LSF to Slurm | bsub/sbatch, LSF queues to Slurm partitions |
Best approach: Read your migration guide first, then work through Track 1-2 (or Track 3 if you were a power user on the old scheduler).
Deployment Overlays¶
Every module in the training set covers concepts that apply across all deployment platforms. Deployment-specific differences are highlighted in callout blocks throughout the modules:
ParallelCluster Note: AWS ParallelCluster-specific behavior or configuration
PCS Note: AWS PCS-specific behavior or configuration
On-Prem Note: On-premise-specific considerations
For deep dives into deployment-specific topics, see:
| Deployment | Module | Key Topics |
|---|---|---|
| On-Premise | Infrastructure, networking, storage, identity, configless mode, large cluster tuning | |
| AWS ParallelCluster | YAML config, static/dynamic nodes, FSx/EFS, EFA, Spot instances, cost management | |
| AWS PCS | Managed Slurm, cluster sizing, custom settings, multi-cluster sackd, accounting |
Suggested Learning Paths by Role¶
Structural Biologist (cryo-EM)¶
- Track 1 (Getting Started)
- Track 2 modules 7-8 (Resources, Interactive Jobs)
- GPU Jobs
- CryoSPARC User Guide
Computational Chemist (drug discovery)¶
- Track 1 (Getting Started)
- Track 2 (End-User Essentials)
- GPU Jobs
- Schrodinger User Guide
Bioinformatician (genomics pipelines)¶
- Track 1 (Getting Started)
- Track 2 (End-User Essentials)
- Job Arrays + Job Dependencies
- Containers
- Best Practices
HPC System Administrator (new to Slurm)¶
- Track 1-3 (all user modules, to understand the user experience)
- Track 4 (all admin modules)
- Deployment module for your platform
- Application modules for your site's software
IT Director evaluating Slurm¶
- Track 5 (IT Leadership)
- Skim deployment modules for your target platform
Migrating from SGE/PBS/LSF¶
- Your migration rosetta stone
- Track 1-2 (build Slurm muscle memory)
- Track 3 if you were a power user
- Track 4 if you were an admin