Slurm
Introduction
A queue system is installed on our compute servers and some workgroups to manage resources in computing tasks. The software used is Slurm. In the following the use of Slurm is described.
A queue system can be used to queue more computationally intensive jobs so that they are executed as soon as enough resources are available.
Each compute server is a node on which a so-called jobs, i.e. one or more programs, are executed. A job can also run in parallel on several nodes. Each node is basically a resource consisting of a number of CPU cores and a certain amount of RAM.
To run a job on one or more nodes, one only needs to log in to one of the devices involved (ssh).
Commands (selection)
Slurm provides a variety of commands, of which the following should be the most useful for most users:
Information about nodes:
sinfo -N -l
lists the nodes and their status. Here you can also directly see the different types of computing nodes and their availability.
$ sinfo -l PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST interactive up 2:00:00 1-2 no NO all 57 idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,gruenau[1-2,5-10],guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau std* up 4-00:00:00 1-16 no NO all 57 idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,gruenau[1-2,5-10],guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau gpu up 4-00:00:00 1-16 no NO all 37 idle adlershof,alex,bernau,britz,buch,buckow,dahlem,erkner,forst,frankfurt,gatow,gruenau[1-2,9-10],guben,karow,kudamm,lankwitz,marzahn,mitte,nauen,pankow,potsdam,prenzlau,rudow,seelow,spandau,staaken,steglitz,tegel,templin,treptow,wandlitz,wannsee,wedding,wildau gruenau up 5-00:00:00 1-2 no NO all 8 idle gruenau[1-2,5-10] pool up 4-00:00:00 1-16 no NO all 49 idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau
scontrol show node
[NODENAME] shows a very detailed overview of all nodes or a single node. Here you can see all features a node offers. You can also see the current load.
$ scontrol show node adlershof NodeName=adlershof Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=8 CPULoad=0.00 AvailableFeatures=intel,avx2,skylake ActiveFeatures=intel,avx2,skylake ...
Information about submitting jobs:
sbatch JOBSCRIPT
queues a job script.srun PARAMETER
runs a job with parameters ad-hoc. This should only be seen as a replacement to sbatch or as a test command. Examples of srun / sbatch commands can be found further down the page.
Information about running jobs:
squeue
shows the contents of queues.scontrol show job JOBNUMBER
displays information about a specific jobscancel JOBNUMBER
cancels a specific job
More useful commands and parameters can be found in the Slurm Cheat Sheet.
Partitions
Depending on the requirements of the program, different queues (called partitions by Slurm) are available. Here is an overview:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST interactive up 2:00:00 57 idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,gruenau[1-2,5-10],guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau std* up 4-00:00:00 57 idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,gruenau[1-2,5-10],guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau gpu up 4-00:00:00 37 idle adlershof,alex,bernau,britz,buch,buckow,dahlem,erkner,forst,frankfurt,gatow,gruenau[1-2,9-10],guben,karow,kudamm,lankwitz,marzahn,mitte,nauen,pankow,potsdam,prenzlau,rudow,seelow,spandau,staaken,steglitz,tegel,templin,treptow,wandlitz,wannsee,wedding,wildau gruenau up 5-00:00:00 8 idle gruenau[1-2,5-10] pool up 4-00:00:00 49 idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau
The default queue over all available machines is called std. Whenever no partition is explicitly named in a job description, std is automatically selected.
Furthermore, there is a queue interactive, which is limited both in the time limit and in the number of nodes that can be used simultaneously. This queue has a higher priority when processing the jobs and is therefore suitable for test runs or configuration tasks.
$ srun --partiton=interactive -n 1 --pty- bash -i
The queues can also be filtered by specifying certain resources (such as AVX512, GPU, ...) as a condition. In the following example, a node with a GPU is requested:
$ srun -n 1 --gres=gpu:1 ...
Description of the partitions:
- defq: default partition. Used if no partition is specified in the script. All nodes are contained here.
- interactive: Partition for testing jobs. Only interactive jobs (using srun + matching parameters) are allowed here. Allowed time is max. 2h.
- gpu: GPU partition. All nodes have at least 1 GPU. To actually request a GPU as well, this must be specified using gres. This can be the restriction to a model, driver, memory or a number of GPUs. Jobs here have a higher priority for GPU programs.
- gruenau: gruenau partition. Here all gruenau computers are included, otherwise analogous to defq. A maximum of two gruenaus can be allocated at the same time.
Use of sbatch
Using sbatch, predefined job scripts can be submitted. The entire configuration of the job and the commands to be processed are described here exclusively in the script. Typically, job scripts are written as shell scripts, so the first line should look like this:
#!/bin/bash
In the following parameters are inserted, which concern the configuration. Here also conditions and dependencies can be defined. Each configuration line starts with the magicword #SBATCH. Example:
### Let Slurm allocate 4 nodes #SBATCH --nodes=4
squeue --jobs=<ID>
returns more information about the job inside the queue. If no output parameter is specified, Slurm creates an output file slurm_<JOB-ID>.out in the execution folder.
If a combination of resource requests is not supported by the selected partition, Slurm returns an appropriate error message when running sbatch.
Important parameters:
Parameter | Funktion |
--job-name=<name> | Jobname. If none is given, slurm generates one. |
--output=<path> | Output path for both results and errors. If none is given, both outputs are written to the folder of execution. |
--time=<runlimit> | Runtime-Limit in hours:min:sec. When running out of time, the job is automatically killed. |
--mem=<memlimit> | Main memory allocated per node. |
--nodes=<# of nodes> | Number of nodes to be allocated. |
--partition=<partition> | Sets the partition on which the job should run. In none is given, the default partition is used. |
--gres=<gres> | Used for allocating hardware ressources like GPUs. |
Parallel Programming (Open MP) |
|
--cpus-per-task=<num_threads> | Number of threads per task. If for example a node has 4 cores (without Hyperthreads) and all should be used, the parameter should be set to cpus-per-task=4 |
--ntasks-per-core=<num_hyperthreads> | Number of hyperthreads per CPU-core. Values >1 enable hyperthreading if available (not every CPU in pool support HT) |
--ntasks-per-node=1 | Recommended setting for OpenMP (without MPI) |
Parallel Programming (MPI) |
|
--ntasks-per-node=<num_procs> | Number of tasks per Node. In case of MPI-only parallel programming this number should be the same as the number of CPU-cores (without Hyperthreading). |
--ntasks-per-core=1 | Recommended setting for MPI (without OpenMP). |
--cpus-per-task=1 |
Recommended setting for MPI (without OpenMP). |
Beispielskripte (sbatch
)
Hello World (Print hostname)
hello_v1.sh: Four nodes return their hostnames once each.
#!/bin/bash # Job name #SBATCH --job-name=hello-slurm # Number of Nodes #SBATCH --nodes=4 # Number of processes per Node #SBATCH --ntasks-per-node=1 # Number of CPU-cores per task #SBATCH --cpus-per-task=1 srun hostname
Output:
adlershof alex britz buch
hello_v2.sh: Two nodes return their hostnames twice each.
#!/bin/bash # Job name #SBATCH --job-name=hello-slurm # Number of Nodes #SBATCH --nodes=2 # Number of processes per Node #SBATCH --ntasks-per-node=2 # Number of CPU-cores per task #SBATCH --cpus-per-task=1 srun hostname
Output:
adlershof adlershof alex alex
Parallel Programming (OpenMP)
openmp.sh: Here a program with four threads is executed on one node. To do this, the number of requested CPU cores (without hyperthreads) is first set to four. This number is then passed on to OpenMP.
#!/bin/bash # Job name #SBATCH --job-name=openmp-slurm # Number of Nodes #SBATCH --nodes=1 # Number of processes per Node #SBATCH --ntasks-per-node=1 # Number of CPU-cores per task #SBATCH --cpus-per-task=4 # Disable Hyperthreads #SBATCH --ntasks-per-core=1 export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} srun ./my_openmp_program
Parallel Programming (MPI)
hello_mpi.sh: Similar to the Hello World example, here all involved nodes return an output. The code for this example can be found here. The communication and synchronization is done via MPI. srun offers several protocols for the transfer of MPI data. You can get a list of the supported protocols with the following command:
$ srun --mpi=list srun: MPI types are... srun: none srun: pmix_v3 srun: pmix srun: pmi2
In addition to the protocol an MPI implementation (see section mpi-selector) must be selected. Not all MPI implementations support every transmission protocol. A good overview of available combinations and best practices can be found here.
The following script starts four processes on each of 2 nodes, which communicate with each other via pmix_v3. The code was previously compiled using OpenMPI 4: mpic++ mpi_hello.cpp -o mpi_hello
!/bin/bash # Job Name #SBATCH --job-name=mpi-hello # Number of Nodes #SBATCH --nodes=2 # Number of processes per Node #SBATCH --ntasks-per-node=4 # Number of CPU-cores per task #SBATCH --cpus-per-task=1
module load gnu-openmpi/4.0.5
# Kompiliert mit OpenMPI 4 srun --mpi=pmix_v3 mpi_hello
Output:
Hello world from processor gatow, rank 2 out of 8 processors Hello world from processor gatow, rank 3 out of 8 processors Hello world from processor gatow, rank 0 out of 8 processors Hello world from processor gatow, rank 1 out of 8 processors Hello world from processor karow, rank 4 out of 8 processors Hello world from processor karow, rank 5 out of 8 processors Hello world from processor karow, rank 6 out of 8 processors Hello world from processor karow, rank 7 out of 8 processors
Mixed Parallel Programming (OpenMP + MPI)
hello_hybrid.sh: There is also the possibility to combine OpenMP and MPI. Each started MPI process can then start multiple threads on multiple CPU cores. The code for this example can be found here. The following Slurm script starts four processes on 2 nodes, which start 2 threads each. The code was previously compiled using OpenMPI 4: mpic++ -fopenmp hybrid_hello.cpp -o hybrid_hello
!/bin/bash # Job Name #SBATCH --job-name=hybrid-hello # Number of Nodes #SBATCH --nodes=2 # Number of processes per Node #SBATCH --ntasks-per-node=2 # Number of tasks in total #SBATCH --ntasks=4 # Number of CPU-cores per task #SBATCH --cpus-per-task=2
module load gnu-openmpi/4.0.5
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} # Kompiliert mit OpenMPI 4 srun --mpi=pmix_v3 hybrid_hello
Output:
Hello from thread 0 of 2 in rank 0 of 4 on gatow Hello from thread 1 of 2 in rank 0 of 4 on gatow Hello from thread 1 of 2 in rank 1 of 4 on gatow Hello from thread 0 of 2 in rank 1 of 4 on gatow Hello from thread 1 of 2 in rank 2 of 4 on karow Hello from thread 0 of 2 in rank 2 of 4 on karow Hello from thread 1 of 2 in rank 3 of 4 on karow Hello from thread 0 of 2 in rank 3 of 4 on karow
GPU Programming (Tensorflow)
tensorflow_gpu.sh: To be able to use at least one GPU, a suitable resource must be requested in Slurm using gres. Possible requests can be generic, like a certain minimum number of GPU cards (--gres:gpu:2
) or a certain CUDA computing level (--feature=cu80
). Alternatively, a specific GPU can be requested. The code for this example can be found here. An overview of all available models and their gres designations or multiplicity can be found on the GPU server overview page.
#!/bin/bash # Job Name #SBATCH --job-name=tensorflow-gpu # Number of Nodes #SBATCH --nodes=1 # Set the GPU-Partition (opt. but recommended) #SBATCH --partition=gpu # Allocate node with certain GPU #SBATCH --gres=gpu:gtx745 module load cuda python mnist_classify.py
Output (trunc.):
2021-03-29 12:20:10.976419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3591 MB memory) -> physical GPU (device: 0, name: GeForce GTX 745, pci bus id: 0000:01:00.0, compute capability: 5.0) ... Train on 60000 samples Epoch 1/10 ... 10000/10000 - 0s - loss: 1.4880 - acc: 0.9741 ('\nTest accuracy:', 0.9741)
Folder structure
The path in which both the job script and the executing program are located must be available on all nodes under the same path. The program must therefore be installed in the system or be part of a global file system (see file server overview). For Slurm tasks with specially compiled programs and larger input/output files, /vol/tmp is recommended. This folder is available on all nodes. For a better overview, a subfolder should be created with the own user name. By default, the folder is only accessible to the own user. Please do keep an eye on your diskspace consumption on this shared ressource.
cd /glusterfs/dfs-gfs-dist mkdir brandtfa ls -la > drwx------ 2 brandtfa maks 4096 18. Mär 11:29 brandtfa
For saving the data (especially results of the calculations) the own HOME directory can be used. This can also be used for smaller programs and data sets for the calculation itself. Please note the size limitation here as well.
MPI
In the case of multi-process programs that are to run either on one or on several nodes, MPI is used as the communication standard. Here several implementations are installed on all nodes:
- openmpi
- openmpi2
- openmpi4
- mpich
- impi (Intel MPI)
Each of the implementations provides its own headers, libraries and binaries. Programs that are to use MPI must be compiled with the compilers of the respective MPI environment. The peculiarity here is - after logging in, initially none of the implementations is available in the path and must be activated.
The recommended way for this is to use modules
:
$ module load gnu-openmpi/4.0.5
$ mpirun --version
mpirun (Open MPI) 4.0.5.0.88d8972a4085 Report bugs to http://www.open-mpi.org/community/help/
These can also be used in Slurm scripts. For this, the module load
commands should be called before other scripts (see example scripts for MPI or GPU programming).
An alternative is to use mpi-selector
for activation, but modules should be used if possible.
# Get list of all installed MPI-versions $ mpi-selector --list mpich openmpi openmpi2 openmpi4 $ mpi-selector --set openmpi4 $ mpi-selector --query default:openmpi4 level:user
Hinweis: The MPI-environment is only available after a re-login.
$ mpirun --version mpirun (Open MPI) 4.0.5.0.88d8972a4085 Report bugs to http://www.open-mpi.org/community/help/
Best Practices
Use Slurm where possible!
If a program consumes more time and resources and the result of the calculation is not time-critical, it should be executed using the queue system.
Partitions are here to help
While in principle all jobs can be processed on the standard std partition, it is recommended to select the appropriate partitions for special requirements. The jobs on the individual partitions such as gpu or gruenau have higher priority, which means that these jobs are processed faster on a node if there are multiple resource requests.
Only allocate what you need
It is not checked whether a program really only requires the specified number of cores. However, it is in your interest and that of others if you specify correct values and limit your program to these values. It tries to utilize the resources as best as possible, i.e. if two jobs specify that they need 16 cores each, they can run on a node with 32 cores at the same time. If this specification is not correct and a program uses more cores, the resources can no longer be distributed optimally and both jobs on the node require more time.
Set limits
By means of the parameter time
you can specify when your program will be terminated at the latest. You should use this to prevent your program from not terminating due to an error and thus blocking nodes. Note: The interactive partition sets an automatic time limit of 2h.
Backup your data
It can always happen that a job terminates unintentionally. This can happen because the maximum runtime (time) expires, because there is a bug in the program, or because there is an error on the machine. Therefore, if possible, save intermediate results regularly in order to be able to restart the calculation from this point.
Keep everything clean
Data that is no longer needed after the calculation should be deleted at the end of the script.