Running Jobs on Raj with Slurm

Slurm (Simple Linux Utility for Resource Management) is the primary method by which work is scheduled and executed on Raj. Since Raj is a campuswide resource and has a finite number of available CPUs, users are not allowed to run programs and launch simulations directly on compute nodes as they might be able to do on a private cluster. Jobs are submitted to a queue, where they wait for the requested number of resources to become available. Once the resources become available, the job is launched. For more information on how Slurm chooses which jobs to run next and on which nodes, see the Appendix.

Submitting and Monitoring a Job Using Slurm

Jobs are primarily launched by specifying a submission script to Slurm via the following command:

sbatch <your_submission_script>.slurm


For more on writing submission scripts, see the section Writing a Slurm Submission Script. Once your job has been submitted, you can follow its progress using the squeue command. squeue shows a listing of all currently queued jobs and their state. Common states include:

Code State Explanation
CA Canceled Job was explicitly canceled by the user or system administrator
CD Completed Job has terminated
F Failed Job terminated with a non-zero exit code
PD Pending Job is awaiting resource allocation
R Running Job has a resource allocation and is currently executing

If you wish to cancel a job which is currently running, you can use the command scancel <jobid>. Finally, to see a list of resources available and their states, use the command sinfo. For a more comprehensive list of common Slurm commands, see the section Common Slurm Commands.

Writing a Slurm Submission Script

A submission script is essentially a bash script which describes the actions to be taken by the scheduler. The script can be broken down into three parts: the hashbang, the directives and the commands. The hashbang line is the first line of any bash script and tells the computer how to interpret the rest of the script. The directives are Slurm specific lines which tell the scheduler how many resources are being requested and for how long. Finally, the commands are the bash commands used to launch the job. A simple example of a submission script is shown below.

#!/bin/bash
#SBATCH --job-name="gaussianjob"
#SBATCH --partition=batch
#SBATCH --time=500:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=128
#SBATCH --cpus-per-task=1
#SBATCH --mem=500GB
#SBATCH --output=%x-%j.log
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=my.email@marquette.edu

export SLURM_SUBMIT_DIR=/mmfs1/home/username/simulation/directory
cd $SLURM_SUBMIT_DIR
module load gaussian/g16
export GAUSS_SCRDIR=/local/$SLURM_JOBNAME-$SLURM_JOBID
mkdir $GAUSS_SCRDIR
g16 gaussjob.com
rm -rf $GAUSS_SCRDIR

Line 1. This is the hashbang line which tells the system this is a bash script.

Line 2. This line begins the Slurm directives. #SBATCH lets bash know that this is a Slurm directive. The argument --job-name="gaussianjob" and sets the jobname as gaussianjob.

Line 3. The directive #SBATCH --partition=batch sets this job to be queued into the default partition: the batch partition. Raj has three queues, the batch queue, the debug queue and the ai queue. The batch queue is where most jobs will be run and has no special properties. The debug queue restricts the amount of resources which can be requested but gives a small boost in priority. The ai queue gives you access to the ai nodes. For a more detailed description of the queues, see the section on Understanding the Queue.

Line 4. This directives lets Slurm know the job will take no longer than 500 hours to complete. At the end of 500 hours if the job is not complete, Slurm will kill the job to free up the resources. Therefore, it is always better to overestimate, rather than underestimate walltime. However, overestimating by too much time may cause your job to sit in the queue longer while the resource manager runs smaller quicker jobs. For a better understanding on how Slurm chooses which jobs to run next and on which nodes, see the section on Understanding the Queue. The default value of the time directive is 1 hour or 1:00:00.

Lines 5-7. These directives are what request compute resources from the scheduler. By default, each task (specified by --ntasks) are assigned one cpu core. Depending on the resources available, these tasks can all be run on a single node, or the tasks may be split between multiple nodes. If you are running multiple tasks and want to restrict them to all running on a smaller subset of nodes, the #SBATCH --nodes=N directive can be issued where N is the number nodes requested. The tasks will then be assigned to the nodes in a round-robin fashion. In this case, we are requesting that all tasks run on a single node. Additionally, if you want more than one core per task, the #SBATCH --cpus-per-task directive can be issued. Note that for this job, there is only one cpu assigned per task so eight cpus will be requested, but if --cpus-per-task were set to 2, then 16 cores would be requested (two for each task). The default values for --nodes, --ntasks and --cpus-per-task are all 1.

Line 8. On the Raj system, by default, Slurm sets the directive --mem-per-cpu=4gb. This default can be overridden in one of two ways. Either by issuing your own #SBATCH --mem-per-cpu=<mem> directive, or by setting the total amount of memory requested for the entire job regardless of number of cores requested (as is done in this line) by issuing the #SBATCH --mem=<mem> directive.

Line 9. This directive redirects the terminal output to the specified file (in this case a file named %x- %j.log). %x is a file pattern which expands to the jobname variable and the %j file pattern expands to the jobid. So if this job were running with a jobid of 2, the output file would be named gaussianjob-2.log. The output file could be named something simple like output.log or output.txt. However, each successive run would then overwrite the previous logfile. File patterns are a good way to keep records, especially if a single directory contains multiple runs of multiple types of jobs. By default, Slurm redirects both STDOUT and STDERR to this file. However, if you would like a separate file for STDERR, then you can specify that with the Slurm directive #SBATCH --error=<filename>

Lines 10-11. These directives tell Slurm to send an email to my.email@marquette.edu when the job begins, when the job is finished running and if the job fails. Other valid events are NONE, REQUEUE (job was re-queued), TIME_LIMIT (when the job reached its maximum walltime), TIME_LIMIT_90 (job has been running for 90% of its allotted walltime), TIME_LIMIT_80, TIME_LIMIT_50 (job has been running for 80% and 50% of its allotted walltime respectively) and ARRAY_TASKS (emails user after completion of each task in an array of tasks). See the section Running Array Jobs for more details on job arrays. The default value --mail-type is NONE.

Lines 13-14. All jobs run in the directory specified by the environment variable $SLURM_SUBMIT_DIR. By default, as part of the job's start-up processes Slurm sets SLURM_SUBMIT_DIR=$PWD. However, for organizational purposes, some users do not keep their submission scripts in the same directory as their job files. These lines show you how to override Slurm's default working directory.

Lines 15-19. Sets up and executes an example job. In this case a Gaussian job.

Running MPI Jobs

To run an mpi job, the following template can be used to write your own submission script.

#!/bin/bash

#SBATCH --job-name="mpi"
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=64
#SBATCH --output=%x-%j.log

module load mpich/ge/gcc/64/3.3.2

HOSTFILE=$SLURM_SUBMIT_DIR/hostfile-$SLURM_JOBID
scontrol show hostnames > $HOSTFILE

mpirun -np 64 -hostfile=$HOSTFILE ./myprog param1 param2

rm $HOSTFILE

In this template, we are requesting 64 cores. To achieve this, we specify --ntasks=64. Next, we need to let mpich know which hosts the program will execute on. First we define a variable pointing to our hostfile. In this example we use the variable $HOSTFILE, and we define it as a file called hostfile-$SLURM_JOBID located int the $SLURM_SUBMIT_DIR. This ensures that each job will create a unique hostfile and there will be no accidental mixups of two jobs running in the same directory accidentally overwriting one another's hostfiles. Next we populate the hostfile with the hosts being used for this job using the command scontrol show hostnames > $HOSTFILE. We then ask for 64 processors with our mpirun command using the -np 64 option, and pass the hostfile to mpirun using the option -hostfile=$HOSTFILE. Finally we invoke our executable (in this case a program called myprog) and pass along any command line arguments (in this example two dummy parameters param1 and param2). After the mpi process has finished, we clean up our workspace by deleting the hostfile. Note that this submission script omits many of the Slurm directives in the example script shown in the previous section. All directives that are not specified will be filled in with their default values. Also note that in this example we are using mpich2 as our flavor of mpi. If we wanted to use OpenMPI, the following template can be used.

#!/bin/bash

#SBATCH --job-name="mpi"
#SBATCH --partition=batch
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=64
#SBATCH --output=%x-%j.log

module load openmpi-geib-cuda10.2-gcc/4.0.5

mpirun myprog param1 param2

OpenMPI and Slurm are designed to be compatible when configured during compilation. The version of OpenMPI installed on Raj is compiled with with Slurm libraries, as well as with the GigaByte Ethernet, Mellanox InfiniBand and Cuda drivers/libraries. Thus, when we invoke our executable, we do not need to specify the number of cores or the hostfile. These are automatically passed to mpirun by Slurm.

Running Job Arrays

Sometimes you may want to run a large number of similar jobs. For example, you may want to run multiple jobs which process different input data using the same program or run the same program multiple times using different parameters passed to it through the command line (for instance when doing a bootstrap analysis or Monte Carlo simulation). For these types of jobs, an array job can be used. The following submission script runs 10 instances of the same program while passing the argument 1 to the first instance, the argument 2 to the second instance, etc.

#!/bin/bash

#SBATCH --job-name="array"
#SBATCH --partition=batch
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --output=%x-%j.log
#SBATCH --array=1-10

srun myprog $SLURM_ARRAY_TASK_ID

Note that if you want to limit the number of jobs which can be run simultaneously you can set --array=1-10%n, where n is the number of simultaneously running jobs.

Requesting GPUs

Raj has several nodes which contain NVIDIA v100 GPUs (see section System Architecture). To make use of these GPUs, they will need to be requested by using the Generic Resource (GRES) directive. An example submission script is shown below.

#!/bin/bash

#SBATCH --job-name="cuda"
#SBATCH --partition=batch
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=64
#SBATCH --gres=gpu:1
#SBATCH --output=%x-%j.log

module load cuda/toolkit/10.2.89

./mycudaprog

Here we are requesting 64 cores and 1 GPU. When requesting a GPU, two additional directives can be used: --cpus-per-gpu and --mem-per-gpu. These directives can be used to more easily scale up the number of GPUs being used. 

Requesting an AI/ML node

To request and AI/ML node, add the directive --partition=ai. A template for requisition of an ai/ml node is included below.

#!/bin/bash

#SBATCH --job-name="cuda"
#SBATCH --partition=ai
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=36
#SBATCH --gres=gpu:8
#SBATCH --output=%x-%j.log

module load cuda/toolkit/10.2.89

./mycudaprog

The AI/ML nodes utilize Intel CPUs rather than AMD CPUs. Thus, if you are migrating a job from the GPU compute nodes to the AI/ML nodes make sure to update the --ntasks and --mem directives to reflect the different node specifications.

Initiating an Interactive Job

There may be times where a particular job does not script well but still needs leverage parallel processing. In this case, it is beneficial to run an interactive job. This can be done in one of two ways. The first way is to issue the command salloc. This command requests resources from the scheduler and opens a shell to give access to the allocated resources. The options for salloc are identical to the directives which can be issued in a submission script (e.g. --time, --partition, --ntasks, etc.), if these options are not specified, they are automatically filled in with their default values. To request an interactive job with 12 cores, issue the command:

salloc --nodes=1 --ntasks=12

Note that all commands run in this shell are by default run on the login node. To run a command on the allocated resources, prefix the command with srun. For example, say you had a Python script you were troubleshooting. Instead of writing a submission script, editing the code, submitting a batch script, then reading the log file, you could run an interactive job. Using the example salloc command displayed above, you secure a 12 core allocation. Then you could edit code on the login node, then run a test job on your allocated resources using srun.

vim myprog.py #edit script
srun python myprog.py # run script on allocated resources

This also works for custom c code written with mpi. However, since OpenMPI is compiled with Slurm support and mpich2 requires a specified hostfile, the srun requirement can be skipped. A sample interactive workflow is shown below, where the editing and compiling is done on the login node and the program run on the allocated resources.

# For OpenMPI
module load openmpi-geib-cuda10.2-gcc/4.0.5 #load module
vim myprog.c #edit code
mpicc -o myprog myprog.c #compile
mpirun myprog #run

# For mpich
# First change from openmpi to mpich
module switch openmpi-geib-cuda10.2-gcc/4.0.5 mpich/ge/gcc/64/3.3.2 vim myprog.c #edit code mpicc -o myprog myprog.c #recompile mpirun -np 64 -hosts=$SLURM_JOB_NODELIST ./myprog #run

To exit the job and relinquish the allocation, simply exit from the shell using the exit command or the ^D shortcut.

Running Interactive Jobs with a Graphical Interface

If you want to run an interactive job with a graphical user interface (GUI), you will need to initiate the job using the --x11 flag. A sample workflow is shown below.

salloc --x11 --nodes=1 --ntasks=64 --time=5:00:00
srun myprog

To exit the job and relinquish the allocation, exit the program then exit the shell with the exit command or ^D.

Running the MATLAB GUI Interactively

When running MATLAB interactively, make sure to specify --ntasks=1 and --cpus-per-task=n where n is the number of processors requested. Specifying --ntasks=n as the number of processors requested will result in Slurm opening n instances of MATLAB. Additionally, the --pty (pseudo terminal) flag needs to be added to the srun command. A MATLAB workflow would look something like this:

salloc --x11 --nodes=1 --ntasks=1 --cpus-per-task=64 --time=5:00:00
module load matlab/R2020a
srun --pty matlab