| Term | Description |
|---|---|
| partition | Collection of computers (nodes). Usually grouped by similar architectural properties (cpus/gpus). |
| account | Collection of users. Used for permitting access to parts of the system. |
| node | A computer in the cluster. Physical hardware at the data center. |
| job | Collection of steps, often just a configuration step and an exeucution step that executes on the cluster. |
| step | A subdivision within a job. A set of tasks that are executed together in parallel or in series within a single step. |
| task | The smallest unit of execution in Slurm. Tasks are typically associated with a specific number of CPU cores. |
| cpu | A physical processor capable of doing a task. Within some tools, may be referenced as a core or thread. Be sure to read the documentation of your tool to determine specifics. |
| gpu | A powerful processor optimized for floating point operations that is typically useful for machine learning and other bespoke pipelines/algorithms. *Be sure to read the documentation of the tools you are using before you ask for one of these limited resources. |
| mem | Memory (RAM). Used to allocate working resources for your task. |
10 The Slurm Scheduler
10.1 Access
See Chapter 1 for instructions on getting access to the HPC. You will also want to request an association in order to get access to storage space and compute resources.
It is important to understand that the node you connect to when you first log in to the cluster is just a login node.
When running compute-intensive tasks on the Sasquatch HPC, it’s crucial to use the worker nodes instead of the login node. You should always launch your jobs using
srunfor interactive sessions orsbatchfor batch scripts.The login node is a shared resource for all users to manage files, compile code, and submit jobs. If you run resource-intensive work on this node, you can slow down or freeze the system, preventing others from performing essential tasks. In such cases, your processes may be terminated by an administrator to restore normal operations.
10.2 Using slurm on Sasquatch
- Submitting jobs Chapter 11
- Monitoring jobs Chapter 12
- Logging Chapter 13
- Scaling jobs Chapter 14
10.3 Common Slurm Terms
10.4 HPC Architecture
Overall architecture diagram of Sasquatch can be found in Section 5.2.
Information on the different clusters for Posit can be found in Chapter 22.
10.5 Learning resources: SLURM
Workload Manager Rosetta Stone - useful resource for users new to SLURM but familiar with other job schedulers (PBS).
SLURM cheatsheet - quick guide for reference
SLURM commands and options - man pages for all slurm commands including
srun,sbatch,sacct, etc.Additional SLURM Documentation - more complete documentation on features and applications (from SchedMD)