Slurm

When you connect to NIC5, you land to the login node (nic5-login1). The login node is shared by all users connected to NIC5. This node is the entry point to NIC5 and is not intended to run resource intensive calculation. Calculation requiring significant resources should be run on the compute nodes.

The allocation of the resource of the compute node of NIC5 (and almost all HPC clusters) is organized by a piece of software called a resource manager or job scheduler. Users submit jobs, which are run unattended, by the job scheduler at the time, and on the resources, decided by the scheduler algorithm. In the case of NIC5, the resource manager and job scheduler is Slurm.

Gathering information about a cluster

When submitting a job to an HPC cluster, you need to have some information about its organization and hardware resources; For example:

How many compute nodes are available?
The number of cores available on each of these compute nodes
How much memory does the compute node have?
The maximum time a job can run on a compute node

Gathering information about a cluster managed by Slurm is done using the sinfo command. This command will print information about the compute nodes and their state:

 $ sinfo
PARTITION AVAIL   TIMELIMIT  NODES  STATE NODELIST
batch*       up  2-00:00:00     12    mix nic5-w[030,041,043-046,048-049,051-052,065,069]
batch*       up  2-00:00:00     58  alloc nic5-w[001-029,031-040,042,047,050,053-064,066-068,070]
hmem         up  2-00:00:00      3   idle nic5-w[071-073]
bio          up 62-00:00:00      1   idle nic5-w074

The PARTITION column indicate the partition the compute node belongs to. The partitions can be considered job queues, each of which has an assortment of constraints such as type of hardware, job size limit, job time limit. NIC5 is organized in three partitions:
- batch which is the default partition (indicated by the * next to the partition name) contains all the nodes with 256 GB of memory.
- hmem is the partition that groups all the high-memory node (1TB).
- bio contains a private compute node reserved to a particular research group.
The AVAIL column refers to the availability of the partition. A partition is the up state is available.
The TIMELIMIT column gives you the maximum time a job can run for a particular partition. The format is DD-HH:MM:SS. A time limit of 2-00:00:00 means that the maximum time a job can run is 2 days.
The NODES column indicate the number of nodes in a particular state for a given partition.
The NODES column gives you the state of nodes:
- alloc means that the nodes are fully allocated: all CPU cores on these nodes are used by running jobs.
- mix means that the nodes are partially allocated: some cores of theses nodes are free.
- idle means that there are no jobs running on that node: all cores are free
The NODELIST list the nodes in a given state and partition.

A more concise output can be obtained using the -s option.

 $ sinfo -s
PARTITION AVAIL   TIMELIMIT   NODES(A/I/O/T) NODELIST
batch*       up  2-00:00:00        70/0/0/70 nic5-w[001-070]
hmem         up  2-00:00:00          0/3/0/3 nic5-w[071-073]
bio          up 62-00:00:00          0/1/0/1 nic5-w074

With the -s option, only one line is printed for each partition with the column NODES(A/I/O/T) presenting the number of node in each states.

A gives the number of allocated nodes, which in this case means nodes with at least one core allocated to a job.
I gives the number of idle nodes.
O gives the number of nodes in another state than A or I. Usually, it's nodes that are down or not available because there is something wrong with them, either on the software side or hardware side.
T gives the total number of nodes regardless of their state

If you want to gather information about the maximum number of CPU cores and memory available, you can use a custom output format

 $ sinfo --format="%10P %.5a %.11l %.6D %.4c %.10m"
PARTITION  AVAIL   TIMELIMIT  NODES CPUS     MEMORY
batch*        up  2-00:00:00     70   64     257700
hmem          up  2-00:00:00      3   64    1031900
bio           up 62-00:00:00      1  256    2064000

An alternative is to use the node centric output with a the --Node and --long options.

sinfo --Node --long
Fri Sep 29 17:52:34 2023
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
nic5-w001      1    batch*   allocated 64     2:32:1 257700        0      1 amd,rome none
nic5-w002      1    batch*   allocated 64     2:32:1 257700        0      1 amd,rome none
nic5-w003      1    batch*   allocated 64     2:32:1 257700        0      1 amd,rome none

...

nic5-w072      1      hmem   allocated 64     2:32:1 103190        0      1 amd,rome none
nic5-w073      1      hmem        idle 64     2:32:1 103190        0      1 amd,rome none
nic5-w074      1       bio        idle 256    2:64:2 206400        0      1 amd,rome none

Submitting a job

A job is composed of two components.

A resource component: the number of cores, the number of nodes, memory, the maximum time during which a job needs the resource...
Compute component: setup of the environment in which the application needs to be run and how the command(s) to run.

A job is specified by a batch script, stored as a text file, which includes specifications for the required resources and the command or commands to be executed. This job is subsequently submitted the Slurm scheduler that will examine the resource requirements from the batch script and verify the availability of the resource. If the requested resources are available and your job has a sufficiently high priority, it will begin execution immediately. However, if the resources are currently unavailable, your job will be placed in a queue and will remain there until the required resources become accessible.

You first job batch script

To illustrate the process of a job submission, let's consider the following job batch script:

Source code for this example

firstjob_submit.sh

#!/bin/bash
#SBATCH --job-name="My first job"
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00
#SBATCH --output=firstjob.out

echo "Hello! I'm job with ID ${SLURM_JOBID}."
echo "I'm running on compute node(s) ${SLURM_JOB_NODELIST}."

To submit this job, we will put the content of the code block above in a text file that. We will name this file firstjob_submit.sh and use the simple command line text editor (nano). To create the file, run the command

nano firstjob_submit.sh

Then, paste the content for the batch script. To exit nano, press the Ctrl+X keys then press Y to confirm you want to save the change and finally, press Enter to confirm the name of the file you want to write.

Submit your first job

Now that our job batch script is ready, we can submit it using the sabtch command

sbatch firstjob_submit.sh

If everything works as intended, the sbatch command should produce an output similar to

Submitted batch job 5971751

The number at the end of the output is called a job ID and is a unique identifier assigned by Slurm to every submitted job. This identifier can be used later to alter, cancel or get information about a job.

Common error

We often see users submitting jobs using bash JOBSCRIPT instead of sbatch JOBSCRIPT. While these two commands are similar, the first one will cause your job script to run on the login node resulting in poor perfomance.

After submitting your job, you should see the sbatch output after a few seconds

Submitted batch job <ID>

If not, you probably used bash instead of sbatch. You will have no output or the output of the commands run in you batch script then you should immediatly terminate the process by pressing the keys Ctrl+C.

A closer look at the script

Now that we have submitted our first job, let's take a moment to look at the batch script in detail. The first line is called a shebang line that is commonly used in UNIX-like system to indicate the interpreter to use. A shebang line is always the first line of a script and starts with #! followed by the name/path to the interpreter executable.

#!/bin/bash

In our case we want to use bash as the scripting language so we specify bash as the interpreter in the shebang line.

Shebang is mandatory

The shebang line is mandatory. If you omit it, sbatch will fail with the following error message

sbatch: error: This does not look like a batch script.  The first
sbatch: error: line must start with #! followed by the path to an interpreter.
sbatch: error: For instance: #!/bin/sh

The next 5 lines are prefixed by #SBATCH and are directives for Slurm. For example, the first of these line

#SBATCH --job-name="My first job"

sets the name of the job to My first job. The next two lines request some resource for the job.

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

We will not discuss the meaning of the --ntasks option right now. It will come later when we discuss MPI. The important thing to understand right now is that we request one CPU core with --cpus-per-task=1. On the next line, we specify the time limit for the job.

#SBATCH --time=01:00

Here, we set the time limit to one minute. The format for this option is DD-HH:MM:SS. For example

1-12:00:00 is requesting 1 day and 12 hours
06:00:00 is requesting 6 hours
15:00 is requesting 15 minutes

The --time option specifies the upper limit. If the job run for longer than this limit, then the scheduler will terminate it.

The last directive (--output) specify the output file we want to use for this job.

#SBATCH --output=firstjob.out

The final two lines of our script are the commands we want to run on the compute node.

echo "Hello! I'm job with ID ${SLURM_JOBID}."
echo "I'm running on compute node(s) ${SLURM_JOB_NODELIST}."

The echo command is used to print something to the standard output (similar to printf in C). ${SLURM_JOBID} and ${SLURM_JOB_NODELIST} are variables set automatically by Slurm. If we look at the output produced our job using the cat command which prints the content of a file to the terminal we get

 $ cat firstjob.out
Hello! I'm job with ID 5971751.
I'm running on compute node(s) nic5-w070.

Here, ${SLURM_JOBID} has been replaced by our job ID (5971751) and ${SLURM_JOB_NODELIST} by the name of the compute node on which the job ran (nic5-w070).

Inspecting the queue

It's very likely that the job we submitted in the previous section started almost immediately as we only requested one core. It is also very short as the only thing we do in this script is printing some information about the job.

In order to artificially increase the duration of the job, we will add sleep 60 (wait 60 seconds) at the end of it

#!/bin/bash
#SBATCH --job-name="My first job"
#SBATCH --ntasks=200
#SBATCH --cpus-per-task=1
#SBATCH --time=02:00
#SBATCH --output=firstjob.out

echo "Hello! I'm job with ID ${SLURM_JOBID}."
echo "I'm running on compute node(s) ${SLURM_JOB_NODELIST}."

sleep 60

and submit it again

 $ sbatch firstjob_submit.sh
Submitted batch job 5971997

Now, we will inquire about the status of our job using the squeue command which is the command to inspect the Slurm jobs queue. By default, this command prints information about all the jobs running or waiting in the queue. To only get information about your jobs, you need to add the --me option.

 $ squeue --me
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5971997     batch My first  olouant PD       0:00      1 (Resources)

Looking at the ST column, we see that that job is in a PD state, which is an abbreviation for "Pending". The job is waiting for a compute node to be available. It is confirmed by the NODELIST(REASON) column that indicates that the job is waiting for resources to become available. Another possible output might be

 $ squeue --me
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5971997     batch My first  olouant  R       0:04      1 nic5-w070

Here, the job is in a R state (running) and has been running on node nic5-w070 for 4 seconds.

A last possibility is that your job is already completed, either because all the commands successfully terminated, or because of an error in your job script. In these cases, no job will be visible in the output of the squeue command:

 $ squeue --me
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Canceling a job

Sometimes, you realize you made a mistake in your job script and want to correct it. However, correcting your job script while your job is pending in the queue or when your job is already running will have no effect. In order to correct your mistake, you need to cancel your job.

The Slurm command to use to cancel a job is scancel followed by the job ID of the job you which to cancel. For example, to cancel the job submitted in the previous section, we can use the command

scancel 5971997

If you don't remember the ID of the job you want to cancel, you can always use the squeue --me command to retrieve it.

Summary

Command	Description
`sinfo`	Gather information about the nodes and partitions
`sbatch BATCH_SCRIPT_FILE`	Submit a job defined in the file `BATCH_SCRIPT_FILE` to the queue
`squeue --me`	List your job in the queue
`scancel JOBID`	Cancel a job with job ID `JOBID`