Job Submission with Slurm

This page can be used as an introductory guide to job submission or a Slurm command reference. Each section can be read sequentially if you are new to Slurm, or referenced individually if you are already familiar with job scheduling.

Introduction

Slurm is a job scheduler and resource management program for computer clusters similar to both Moab and Torque. If you are already familiar with the Moab msub or Torque qsub command, adapting to Slurm is very straight forward. A command reference for Slurm and PBS/Torque is available here to assist with translating existing job scripts.

Job Submission

Job submission in Slurm is split between two commands: srun, and sbatch. sbatch is used to submit job scripts to the scheduler, while srun is used to run programs directly from the command line as jobs. Both commands share the same commandline arguments.

The rest of this page contains examples explaining how to run various types of jobs using srun and sbatch. Reference sections for commonly used environment variables, job parameters, and filename patterns are available at the bottom of the page.

Table of Contents

PBS Job Script Translation


Slurm provides its own qsub command, which attempts to seamlessly convert PBS job submission scripts to SBATCH scripts. This is the fastest way to test your existing job scripts against the Slurm scheduler with minimal changes. There are a few differences in how the Slurm scheduler and Moab scheduler are configured, however, which require slight modification to exiting PBS scripts:

In addition to these four changes, any references to environment variables provided by Moab/Torque must be updated to use the Slurm equivalent environment variables. A list of common variables is available in the Environment Variables section below. For the full variable listing, view the official documentation here.

Common Translation Errors

1. PBS_O_WORKDIR must change to SLURM_SUBMIT_DIR

If you have been using PBS_O_WORKDIR in your batch script to set your file path, Slurm will not recognize it and your job will complete without output. Use the Slurm equivalent to PBS_O_WORKDIR, SLURM_SUBMIT_DIR instead.

2. Replacing mpirun with srun for MPI codes

CADES MPI modules and programming environments have been recompiled with Slurm support, allowing use of the srun command in place of mpirun. We encourage everyone to use srun in place of mpirun or mpiexec because it is now the better supported method of launching mpi job processes. If you encounter issues updating job scripts to use srun, the mpirun binaries are still provided and can be used instead. See the OpenMPI Slurm FAQ here for more information.

📝 Note: If you are submitting your job to CADES Condo resources, view the page on Slurm Resource Queues for more information on accounts and resource queues.

Srun


Getting Started

The example below shows how to run a program directly from the commandline as a job using srun. This method of job submission is useful for quick one-off jobs like the one shown, but generally should not be used for complex jobs with a lot of resource requirements.

HelloWorld.sh

#!/bin/bash

echo "Hello World!"

To submit the example HelloWorld.sh script as a job, run the following srun command:

srun -A <account_name> -p <partition_name> -N 1 -n 1 -c 1 --mem=8G -t 10:00 test.sh 

Each argument used in this example is detailed below:

Arguments

 -A : Account to run the job under
 -p : Partition to run the job in
 -N : Nodes requested for the job
 -n : Tasks the job will run
 -c : CPU cores that each task requires
 --mem : Required memory per node
 -t : The requested walltime for the job

The last argument in the command lists the path and name of the program to run, which in this case is the example script located in the current directory.

If you are unsure of what account to specify for the -A argument, use the following command to list all of the accounts you are a member of:

sacctmgr show assoc where user=<uid> format=account

To find valid values for the -p argument, use the command sinfo to list all of the available partions in the cluster.

Interactive Jobs

Interactive jobs allow direct access to node hardware to run jobs. This is typically used when testing or profiling a job, or when running a program that requires user input.

srun can be used to start an interactive job with one additional argument: --pty. An example command is shown below:

srun -A <account_name> -p <partition_name> -N 1 -n 1 -c 1 --mem=8G -t 1:00:00 --pty /bin/bash

In this example the --pty argument was added, and the last argument listing what program to run was changed from the test script to the bash shell.

New Arguments

 --pty : Run the program listed as the last argument in pseudo-terminal mode

Sbatch


Non-interactive Jobs

The sbatch command is capable of parsing properly formatted script files and running them as jobs. Job scripts are comprised of three main components: the interpreter declaration, the sbatch arguments, and the job commands.

HelloWorld.sbatch

#!/bin/bash
# Interpreter declaration

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./test-output.txt
#SBATCH -e ./test-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
# sbatch arguments

./HelloWorld.sh
# Job commands -- this is the same HelloWorld.sh script used in the first example

This example is similar to the srun command line example above, but uses several more arguments to specify additional resources and contraints:

New Arguments

 -J          : Job name
 --mem       : Required memory per node. When this is set to zero, all available memory is requested
 -o          : File to redirect standard output to
 -e          : File to redirect standard error to
 --mail-type : Events to send email notifications for
 --mail-user : Email address to send job notifications

Submit the job script

Another advantage of the job script over the command line srun example at the top, is that it is much easier to submit.

To submit your job script to the compute nodes issue:

sbatch HelloWorld.sbatch

Example Output

Hello World!

Parallel Program Execution

The previous example demonstrated how to run a job on one core and one node. This example extends the previous one to run a program across multiple cores and multiple nodes. A few of the environment variables Slurm provides are also demonstrated.

HelloWorld.sh

#!/bin/bash

echo "Hello World! Node:${SLURMD_NODENAME} Core:${SLURM_PROCID}"

Two Slurm-controlled environment variables have been added to the HelloWorld.sh echo command. These variables will display the name of the node each job task is running on as well as the task number.

Multithread-HelloWorld.sbatch

#!/bin/bash

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --ntasks-per-node=2
#SBATCH -J multithread-test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./%j-multithread-output.txt
#SBATCH -e ./%j-multithread-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>

srun ./HelloWorld.sh

There are five major differences between this job script and the one in the previous example:

The -N and -n parameters were changed to request two nodes and four tasks respectively. Because the cores per task option (-c) is still set to one, this script is requesting four processors on two nodes.

The --ntasks-per-node option ensures that two of the four requested tasks run on each node. Without this option, the scheduler may schedule three task on one node and one on another because the nodes have more cores than requested by the job script.

The %j added to the output and error file names is a special string called a "filename pattern" that is recognized by Slurm. When the job is submitted, the %j will be replaced with the job ID number.

The addition of the srun command before the script file tells Slurm to automatically run the script once for each requested job task. Using srun to schedule the program will also cause the scheduler to increment Slurm-controlled environment variables before running each task.

New Arguments

 --ntasks-per-node : Number of tasks to run per node

Environment Variables

 SLURMD_NODENAME : The name of the node that the current job task is running on
 SLURM_PROCID    : The ID number of the current job task running on the node

Filename Patterns

 %j : Job ID

Submit the job script

To submit your job script to the compute nodes issue:

sbatch Multithread-HelloWorld.sbatch

Example Output

Hello World! Node:or-slurm-c01 Core:2
Hello World! Node:or-slurm-c00 Core:0
Hello World! Node:or-slurm-c01 Core:3
Hello World! Node:or-slurm-c00 Core:1

Job Arrays

Job arrays provide a simple way of running multiple instances of a job with different data sets. This example modifies the previous example to run two instances of the same job. Additional job array specific environment variables are also demonstrated.

HelloWorld.sh

#!/bin/bash

echo "Hello World! Node:${SLURMD_NODENAME} Core:${SLURM_PROCID} Array:${SLURM_ARRAY_JOB_ID} Task:${SLURM_ARRAY_TASK_ID}"

Two additional environment variables were added to the end of the echo command; the first prints the primary job array ID, and the second prints the secondary ID for each task in the array.

Array-HelloWorld.sbatch

#!/bin/bash

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --ntasks-per-node=2
#SBATCH -J array-test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./%A-%a-output.txt
#SBATCH -e ./%A-%a-multithread-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
#SBATCH -a 0-1%2
#SBATCH --exclusive

srun ./HelloWorld.sh

There are three major differences between this job and the previous one, the addition of the -a parameter, the --exclusive parameter, and the %A and %a filename patterns. -a specifies how many jobs to run in the array, and can optionally include a limit on the number of jobs to run at once. This example specifies two job in the array with the 0-1 range parameter, indicating that one job should be run with ID 0, and another with ID 1. The %2 at the end specifies that up to two job in the array should run at once.

The --exclusive option tells the scheduler to schedule each job in the array exclusively on its own nodes. Without specifying this option, Slurm will pack job in an array onto as few nodes as possible. So don't use the --exclusive flag if the plan is to run multiple jobs on the same node.

The %A filename pattern is replaced by the primary job ID when the job is submitted, and the %a pattern is replaced by the secondary job array task ID. This will cause each job in the array to create its own output and error files.

New Arguments

 -a          : Create a job array with the specified range of job IDs
 --exclusive : Run each job in a job array exclusively on their own nodes

New Environment Variables

 SLURM_ARRAY_JOB_ID  : The primary ID of the job array
 SLURM_ARRAY_TASK_ID : The secondary ID of the running task in the job array.  It can be used
                       inside the job script to handle input and output files for that task.
                       For instance, for a 3-task job array, the input files can be named input_1,
                       input_2 and input_3. In a job script, they can be input_${SLURM_ARRAY_TASK_ID}.
                       The output files can be handled in the same way.

New Filename Patterns

 %A : Job array primary job ID
 %a : Job array task ID

Submit the job script

To submit your job script to the compute nodes issue:

sbatch Array-HelloWorld.sbatch

Example Output

Hello World! Node:or-slurm-c00 Core:0 Array:86 Task:0
Hello World! Node:or-slurm-c00 Core:1 Array:86 Task:0
Hello World! Node:or-slurm-c01 Core:2 Array:86 Task:0
Hello World! Node:or-slurm-c01 Core:3 Array:86 Task:0
Hello World! Node:or-slurm-c02 Core:1 Array:86 Task:1
Hello World! Node:or-slurm-c02 Core:0 Array:86 Task:1
Hello World! Node:or-slurm-c03 Core:2 Array:86 Task:1
Hello World! Node:or-slurm-c03 Core:3 Array:86 Task:1

MPI

This example demonstrates how to run MPI programs with the Slurm scheduler. The module commands in mpi-ring.sbatch may be unnecessary depending on how MPI is installed on the system you are running on. If you are running on the CADES Condos, the job script below can be copied verbatum.

mpi-ring.c

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
  int world_rank, world_size, token;
  MPI_Init(NULL, NULL);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  if (world_rank != 0) {
    MPI_Recv(&token, 1, MPI_INT, world_rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    printf("Process %d received token %d from process %d\n", world_rank, token, world_rank - 1);
  } else {
    token = -1;
  }
  MPI_Send(&token, 1, MPI_INT, (world_rank + 1) % world_size, 0, MPI_COMM_WORLD);
  if (world_rank == 0) {
    MPI_Recv(&token, 1, MPI_INT, world_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    printf("Process %d received token %d from process %d\n", world_rank, token, world_size - 1);
  }
  MPI_Finalize();
}

The MPI program used in this example was copied from Wes Kendall's MPI Tutorial website. A detailed explanation of the example program is available here:
"MPI Send and Recieve"

mpi-ring.sbatch

#!/bin/bash

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --ntasks-per-node=2
#SBATCH -J mpi-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH -o ./%j-mpi-output.txt
#SBATCH -e ./%j-mpi-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>

module purge
module load PE-gnu
srun ./mpi_ring

The only new additions to the sbatch script in this example are the job commands at the bottom. module commands are used to load MPI, then the mpi_ring program is run over Infiniband. Because the MPI and programming environment (PE) modules in Cades have been compiled with Slurm support, srun can be used in place of mpirun. We encourage everyone to use srun instead of mpirun or mpiexec because it is now the better supported method of launching MPI job processes. If you encounter issues using srun, the mpirun binaries are still provided and can be used instead.

Regardless of whether you are using srun or mpirun, the Slurm scheduler automatically passes node name and process count parameters to both commands so that they do not have to be specified manually. These options can be specified manually to override the default values.

Submit the job script

To submit your job script to the compute nodes issue:

sbatch mpi-ring.sbatch

Example Output

Process 1 received token -1 from process 0
Process 2 received token -1 from process 1
Process 3 received token -1 from process 2
Process 0 received token -1 from process 3

GPUs

This example shows how to specify GPUs in an sbatch script.

HelloWorldGPU.sh

#!/bin/bash

#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -G 4
#SBATCH -J gpu-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH -o ./%j-gpu-output.txt
#SBATCH -e ./%j-gpu-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>

nvidia-smi

The -G option above is used to specify the number of GPU resources your job needs. You must specify at least -G 1 to gain access to a GPU. You can also add an optional specifier to indicate the type of GPU that you need. For example, requesting four k80 GPUs would take the form -G k80:4. For a full list of the types of gpus available, reference the resource queues page here, or run the command sinfo -O gres on a login node.

New Arguments

 -G : Reserve nodes with the specified GPU resources

Submit the job script

To submit your job script to the compute nodes issue:

sbatch HelloWorldGPU.sh

Example Output

Mon Jul 29 13:32:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                    0 |
| N/A   35C    P0    61W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   30C    P0    73W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:83:00.0 Off |                    0 |
| N/A   37C    P0    68W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:84:00.0 Off |                    0 |
| N/A   30C    P0    81W / 149W |      0MiB / 11441MiB |     63%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Containers

Singularity containers are the simplest tool for running a containerized workflow in an HPC environment. This example demonstrates how to create a Singularity container from a Docker Hub container and run it using the Slurm scheduler.

As a first step, check that Singularity is installed by running singularity --help. If the singularity command is not found, you will first need to load a Singularity software module, or install Singularity yourself. If Singularity is present, start by creating a new container with the following command:

singularity build cuda-test.sif docker://nvidia/cuda:10.1-base-ubuntu16.04

This should create a file called cuda-test.sif in the current directory. Now that the test container has been created, we can run it using a Slurm job script. Two versions of the same job script are shown below, one for running on the CADES Condos, and one for running on the DGX systems.

CADESCondosingularity.sbatch

#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J singularity-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH -G k80:2

srun singularity exec --nv ./cuda-test.sif nvidia-smi

Submit the job script

To submit your job script to the compute nodes issue:

sbatch CADESCondosingularity.sbatch

DGXsingularity.sbatch

#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -J singularity-test-job
#SBATCH --mem=1G
#SBATCH -t 10:00
#SBATCH --gres=gpu:2

srun singularity exec ./cuda-test.sif nvidia-smi

Submit the job script

To submit your job script to the compute nodes issue:

sbatch DGXsingularity.sbatch

These two job scripts have different singularity exec command arguments. This is due to the difference in the Singularity versions installed between the DGX systems and the CADES Condos. They also use different SBATCH parameters to request GPU resources, which is due to the difference in Slurm versions running on the systems.

Example Output

Fri Jul 26 16:36:49 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   27C    P0    52W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  Off  | 00000000:36:00.0 Off |                    0 |
| N/A   26C    P0    49W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Monitoring the Queue

Below is a table of common Slurm commands and their Moab/Torque equivalents. Example runs for each command are also shown.

Moab Slurm Usage
qsub sbatch Job script submission. Examples shown above.
qsub, pbsdsh srun Interactive job submission and running parallel processes. Examples shown above.
qstat, showq squeue View the state of all jobs in the cluster.
checknode, showbf sinfo View information on queues and nodes in the cluster.
checkjob, mschedctl scontrol View detailed information on cluster jobs, accounts, queues, etc.
canceljob scancel Cancel a running or pending job.

Slurm provides multiple tools to view queue, system, and job status. Below are the most common and useful of these tools.

To see all jobs currently in the queue:

$ squeue

To see the full output of all your queued jobs:

$ squeue -l -u <uid>

Example:

$ squeue -l -u <uid>
Wed Aug 21 16:44:37 2019
             JOBID PARTITION     NAME      USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
                11     batch mpi_hello    <uid> COMPLETI       0:01      1:00      2 or-slurm-c[00-01]

To see how many nodes are avilable in each queue:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
testing*     up    4:00:00      6   idle or-slurm-c[00-03],or-condo-g[05-06]

To check your job's status:

$ scontrol show job <job_id>

📝 Note: If the reason your job is not running is "MaxCpuPerAccount" that means that all the nodes in your access group are being used by other users. For example, if you are a birthright user, you will get this reason when all 36 nodes in the brithright condo are being used by other users at the time you submit your job.

To cancel your job:

$ scancel <job_id>

Why Isn't My Job Running?

Below are explanations of the reasons given by squeue for why your job is still queued:

Priority - Your job is waiting on higher priority jobs to complete.

Dependency - Your job is waiting on a dependent job to complete before it can start.

ReqNodeNotAvail - One or more nodes required by your job are unavailable (reserved, down, being used by other users). Your job will start running when these nodes become available.

Resources - Your job is waiting for resources to become available.

MaxCpuPerAccount, MaxNodePerAccount - All of your group's nodes in the queue are currently in use by other users. Your job will run when nodes become available.

AssocGrpCPURunMinutesLimit - Your job has hit a limit assigned by your condo owner, that prevents a single user from using too large a portion of the system at any one time. This state is temporary and will resolve as resources become available.

See the full list of reason codes in the official documentation here.

Appendix

Below are listed references for all of the Slurm environment variables, job script parameters, and file name patterns used in the examples given above. For full reference pages, see the official documentation linked in each sub-section.

Environment Variables

A list of frequently used environment variables that Slurm provides are detailed below. To view a complete list of all Slurm variables, check the official documentation here.

Job Parameters

A list of frequently used sbatch and srun arguments are detailed below. To view a complete list of all arguments, check the official documentation here.

Filename Patterns

A list of frequently used filename pattern strings are detailed below. To view a complete list of all available patterns, check the official documentation here.