Job Submission

Running calculations on Hydra requires submitting a job to the job queue. Hydra uses Moab as the job scheduler and TORQUE as the resource manager.

VSC Docs: Running Jobs has an extensive introduction to job scripting, submission and monitoring, as well as links to more advance topics. The sections below assume that you are already familiar with the basics of running jobs in the HPC cluster.

Hydra Submit Filter

In Hydra, before a job goes into the queue, a submit filter checks the job script for missing options and errors. The following options are added if not specified by the user:

  • send email when job aborts

  • request 4 GB of memory per core

  • assign jobs requesting 1 node and more than 245 GB to the high-memory node

  • assign jobs requesting 1 or more GPUs to a GPU node

The job submission will be aborted if the requested resources fulfill any of the following conditions:

  • requested RAM memory per core is less than 1 GB

  • requested RAM memory per node is more than total RAM memory

  • requested number of cores per node is higher than total number of cores of the node

  • requested number of cores is less than requested number of GPUs (must request at least 1 core per GPU)

  • requested job queue does not match requested resources

  • features do not exist, do not match requested resources, or are mutually incompatible

Job Restart and Checkpointing

The ability to restart a job that cannot finish before the maximum walltime depends on the software capability to do so. It is common that scientific software provides some mechanism to run for some amount of time, stop in a controlled manner saving intermediary data files on disk and restart the simulation from last calculated step. Such methods allow to split long jobs into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.

If the software does not support any restarting method, there are several options:

  • Increase the parallelization of your job

    Check if you get any speed-up by requesting more cores. Scientific software commonly supports running on multiple cores in the same node (e.g. -l nodes=1:ppn=4). Check also if the software supports multi-node jobs (usually with MPI).

  • Increase the efficiency of your job

    Check for I/O bottlenecks during execution. Is your job intensively swapping data between disk and RAM memory? In that case increasing the memory of the job might improve performance, as disk access is much slower than memory access.

  • Divide your job into parts

    Even if the software does not provide any method to restart the calculation, in many cases you can divide your job into smaller parts that can be recombined afterwards. If the parts depend on each other, save to disk any data needed by subquent parts and submit each part sequentially in its own job.

  • Use faster CPUs

    You can specify that your job must run on the fastest node available using a feature (e.g. -l feature=skylake).

  • Use GPUs

    Check if the software supports running your job on a GPU (or even 2 GPUs), and of course also check if your job gets speed-up from running on GPU(s).

Helpdesk We can help you with issues related to checkpointing/restarting.

Torque/Moab to Slurm transition

Note

We will soon migrate Hydra from Torque/Moab to Slurm. This section is meant for experienced Torque/Moab users to quickly get up and running with Slurm.

The tables below provide a quick reference with translations from Torque/Moab into Slurm that can help you with the migration. Extra explanation is given in the sections below the tables.

Submitting and monitoring jobs

Replace <JOB_ID> with the ID of your job.

Torque/Moab

Slurm

Description

qsub job.sh

sbatch job.sh

Submit a job with batch script job.sh

qsub [resources] -I

srun [resources] --pty bash -l

Start an interactive job, see Interactive jobs

qdel <JOB_ID>

scancel <JOB_ID>

Delete a job

qstat

squeue --states=all or
sacct --starttime=YYYY-MM-DD

Show job queue status

qstat -f <JOB_ID>

scontrol show job <JOB_ID>

Show details about a job

Requesting resources and other options

Torque/Moab

Slurm

Description

-N job_name

--job-name=job_name

Set job name to job_name

-l walltime=HH:MM:SS

--time=DD-HH:MM:SS

Request maximum walltime

-l nodes=1:ppn=1

--ntasks=1

Request a single CPU core

-l nodes=1:ppn=X

--ntasks=1 --cpus-per-task=X

Request multiple cores on 1 node for Shared memory

-l nodes=X:ppn=Y

--ntasks=(X*Y) or
--ntasks=(X*Y) --nodes=X

Request multiple cores on 1 or multiple nodes for Distributed memory

-l pmem=N

--mem-per-cpu=N
default unit = MB

Request amount of memory per CPU core only if needed, see Memory allocation

-l feature=skylake

--partition=skylake

Request skylake CPU architecture, see Slurm partitions

-l feature=pascal

--partition=pascal_gpu

Request pascal GPU architecture, see Slurm partitions

-M email@example.com

--mail-user=email@example.com

Send job alerts to given email address

-m <a|b|e>

--mail-type=
<BEGIN|END|FAIL|REQUEUE|ALL>
select 1 or comma separated list

Conditions for sending alerts by email

-o out_file

--output out_file

Write stdout to out_file

-e err_file

--error err_file

Write stderr to err_file

-j oe

(default, unless --error is specified)

Write stdout and stderr to the same file

Environment variables defined by resource managers

Torque/Moab

Slurm

Description

$PBS_JOBID

$SLURM_JOB_ID

Job ID

$PBS_O_WORKDIR

$SLURM_SUBMIT_DIR

Directory where job was submitted from

$PBS_NODEFILE
(nodes file)
$SLURM_JOB_NODELIST or
$(scontrol show hostnames)
(nodes string)

List of nodes assigned to job

$PBS_JOBNAME

$SLURM_JOB_NAME

Job name

$PBS_ARRAYID

$SLURM_ARRAY_TASK_ID

Job array ID (index) number

$PBS_NUM_NODES

$SLURM_JOB_NUM_NODES

Number of nodes

$PBS_NUM_PPN

see Shared memory and Distributed memory

Number of cores per node

$PBS_NP

see Shared memory and Distributed memory

Total number of cores

Features - partitions

See Slurm partitions for more info.

Torque/Moab features

Slurm partitions

skylake

skylake or
skylake_mpi

broadwell

broadwell

ivybridge

ivybridge_mpi

pascal

pascal_gpu

geforce

geforce_gpu

kepler

kepler_gpu

CPU cores allocation

Requesting CPU cores in Torque/Moab is done with the option -l nodes=X:ppn:Y, where it is mandatory to specify the number of nodes even for single core jobs (-l nodes=1:ppn:1). The concept behind the keyword nodes is different between Torque/Moab and Slurm though. While Torque/Moab nodes do not necessarily represent a single physical server of the cluster, the option --nodes in Slurm specifies the exact number of physical nodes to be used for the job, as explained in subsection Distributed memory applications.

In Slurm the only mandatory request for CPU resources is the number of tasks of the job, which is set with the option --ntasks=X (1 by default). By default, each task gets allocated one CPU core. If you don’t specify anything else these tasks can be distributed among any number of different nodes in the cluster.

  1. Shared memory applications

    Applications that are only capable of using multiple processors in a single server or physical computer are called shared memory applications (eg. parallelized using OpenMP, Pthreads, Python multiprocessing, etc…). Jobs for shared memory applications require additional settings to ensure that all allocated CPU cores reside on the same node.

    The simplest option is requesting a single task with --ntasks=1 and then ask to assign X cores in the same physical server to this task with --cpus-per-task=X.

    With options --ntasks=1 --cpus-per-task=X set, the number of cores allocated ($PBS_NP and here also: $PBS_NUM_PPN) is given by $SLURM_CPUS_PER_TASK.

  2. Distributed memory applications

    Parallel applications based on a distributed memory paradigm, such as the ones using MPI, can be executed by just specify the option --ntasks=X where X is the total number of cores you need. CPU cores will be allocated in one or more nodes, using as little nodes as possible.

    If you want to keep some extra control on how the tasks will be distributed in the cluster, it is possible to specify the number of nodes with the option --nodes=Y. For example, minimizing the amount of nodes assigned to the job can lead to better performance if the interconnect between nodes is not very fast. Imagine a cluster composed of nodes with 24 cores where you want to submit a job using 72 cores, but using precisely 3 full nodes. You can do so by asking for --ntasks=72 and adding the extra option --nodes=3. With the options

    If you want to provide some flexibility to the resources allocated to your job, it is also possible to provide a range of values to --nodes, for example with --nodes=3-5. In such a case, cores will be allocated in any amount of nodes in the range (although it will try to allocate as many nodes as possible). It could still end up allocated in just 3 full nodes, but also in other possible combination, e.g. two full nodes with their 24 cores plus three other nodes with 8 cores each.

    With option --ntasks=X (and optionally --nodes=Y), the number of cores allocated ($PBS_NP) is given by $SLURM_NTASKS, and the number of cores per node ($PBS_NUM_PPN) is SLURM_TASKS_PER_NODE.

Interactive jobs

When submitting interactive jobs with more than 1 core in Slurm, the same considerations apply w.r.t. CPU cores allocation:

Memory allocation

Jobs that do not define any specific memory request will get a default allocation per core, which is the total node memory divided by the number of cores on the node. In most cases, the default memory allocation is sufficient, and it is also what we recommend. If your jobs need more than the default memory, make sure to regularly check their memory usage to avoid allocating more resources than needed.

We highly recommend to specify memory allocation of your job with the Slurm option --mem-per-cpu=X, which sets the memory per core. It is also possible to request the total amount of memory per node of your job with the option --mem=X. However, requesting a proper amount of memory with --mem is not trivial for multi-node jobs in which you want to leave some freedom for node allocation. In any case, these two options are mutually exclusive, so should only use one of them.

The default memory unit is megabytes, but you can specify different units using one of the following one letter suffixes: K, M, G or T. For example, to request 2GB per core you can use --mem-per-cpu=2000 or --mem-per-cpu=2G.

Slurm partitions

In Torque/Moab, specific hardware resources can be requested with features. In Slurm, we provide this functionality with partitions. In most cases, specifying a partition is not necessary, as Slurm will automatically determine the partitions that are suitable for your job.

You can also request a comma-separated list of partitions. For example, to indicate that your job may run in partitions skylake or broadwell you can use --partition=skylake,broadwell. Note however, that a job will only run in a single partition. Slurm will decide the partition based on priority and availability.

Partitions with suffix _mpi are used for jobs that request more than 1 node.

Batch scripts

Torque/Moab job scripts define the resource manager directives in their preface by using the #PBS keyword. In Slurm the equivalent is the #SBATCH keyword. To illustrate its usage with some of the resource request options discussed before, we provide below a basic job script in both systems requesting a single core, 7 hours of maximum walltime and 3GB of memory:

Basic single core Torque/Moab batch script
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/bash
#PBS -N myjob
#PBS -l walltime=07:00:00
#PBS -l nodes=1:ppn=1
#PBS -l pmem=3gb

module load somemodule/1.1.1

cd $PBS_O_WORKDIR

my_code
Basic single core Slurm batch script
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --time=07:00:00
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=3G

module load somemodule/1.1.1

cd $SLURM_SUBMIT_DIR

my_code