Job Submission

Running calculations on Hydra requires submitting a job to the job queue. Hydra uses Moab as the job scheduler and TORQUE as the resource manager.

VSC Docs: Running Jobs has an extensive introduction to job scripting, submission and monitoring, as well as links to more advance topics. The sections below assume that you are already familiar with the basics of running jobs in the HPC cluster.

Hydra Submit Filter

In Hydra, before a job goes into the queue, a submit filter checks the job script for missing options and errors. The following options are added if not specified by the user:

  • send email when job aborts

  • request 4 GB of memory per core

  • assign jobs requesting 1 node and more than 245 GB to the high-memory node

  • assign jobs requesting 1 or more GPUs to a GPU node

The job submission will be aborted if the requested resources fulfill any of the following conditions:

  • requested RAM memory per core is less than 1 GB

  • requested RAM memory per node is more than total RAM memory

  • requested number of cores per node is higher than total number of cores of the node

  • requested number of cores is less than requested number of GPUs (must request at least 1 core per GPU)

  • requested job queue does not match requested resources

  • features do not exist, do not match requested resources, or are mutually incompatible

Job Restart and Checkpointing

The ability to restart a job that failed to finish before the walltime depends on the software capability to do so. It is common that scientific software provides some mechanism to run for some amount of time, stop in a controlled manner saving intermediary data files on disk and restart the simulation from last calculated step. Such methods allow to split long jobs into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.

If your software does not support any restart method, you can use external checkpointing. Checkpointing means making a snapshot of the current memory structures of your calculation and saving those to disk. The snapshot can then be used to restore the exact state of your simulation and continue from there. Checkpointing in Hydra can be done conveniently with csub, a tool that automates the process of:

  • halting the job

  • checkpointing the job

  • killing the job

  • re-submitting the job to the queue

  • reloading the checkpointed state into memory

Example Job script ‘myjob.pbs’ submitted with csub checkpointing and re-submitting every 24 hours. The checkpointing and re-submitting cycle will be repeated until your calculation has completed:

csub -s myjob.pbs --shared --job_time=24:00:00

Technical aspects of csub

  • Checkpointing and reloading is done as part of the job, and typically takes up to 15 minutes depending on the amount of RAM memory being used. Thus, take into account this extra time to set the walltime of your job script:

    Job walltime for a csub job with --job_time=24:00:00
     #PBS -l walltime=24:15:00
    
  • Checkpoint files are written in the directory $VSC_SCRATCH/chkpt along with job output and error files, csub log files and any output files generated by your simulation.

  • Internally, csub uses DMTCP (Distributed MultiThreaded CheckPointing). Users who want full control can also use DMTCP directly. Check our example launch.pbs and restart.pbs job scripts.

  • csub/DMTCP is not yet tested with all installed software in Hydra. It has been successfully used with software written in Python, R, and with Gaussian.

    See also

    DMTCP Supported Apps for official guidelines on supported software.

  • If you are running a Gaussian 16 job with csub, a few extra lines must be added to your job script:

    1
    2
    3
    4
    5
     module load Gaussian/G16.A.03-intel-2017b
     unset LD_PRELOAD
     module unload jemalloc/4.5.0-intel-2017b
     export G09NOTUNING=1
     export GAUSS_SCRDIR=$VSC_SCRATCH/<my_unique_gauss_scrdir>  # make sure this directory is present
    

Helpdesk We can help you with issues related to checkpointing/restarting.

PBS to Slurm Cheatsheet

This section is meant for expirienced users with a PBS based resource manager (such as Torque) that want to quickly get up and running with Slurm.

The tables below contain a quick reference with translations of typical commands and options to request resources in PBS into Slurm that can help you with the migration. Below the tables there are some extra remarks to consider in adapting your jobs.

Submitting and monitoring jobs

PBS

Slurm

Comments

qsub job.sh

sbatch job.sh

Submit a job with the batch script job.sh

qsub -I

salloc <resources options>

Starting an interactive job

qdel job_id

scancel job_id

Delete your job

qstat

squeue

Show the status of the jobs queue

qstat -f job_id

scontrol show job job_id

Show details about your scheduled job

Requesting resources

PBS

Slurm

Comments

-N job_name

--job-name=job_name

Set the name of your job

-l walltime=HH:MM:SS

--time=DD-HH:MM:SS

Requested maximum time for your job to run

-l nodes=1:ppn=1

--ntasks=1

Request a single CPU core

-l nodes=X:ppn=Y

--ntasks=X --cpus-per-task=Y

See the explanation in the section below

-l pmem=N

--mem-per-cpu=N

The amount of memory per CPU core in megabytes

-M email@example.com

--mail-user=email@example.com

Email to send job alerts

-m <a|b|e>

--mail-type=<BEGIN|END|FAIL|REQUEUE|ALL>

Condition for email alerts, for slurm choose one or comma separated list

-o out_file

--output out_file

File to write stdout, in slurm if no --error is given it will combine stdout/stderr

-e err_file

--error err_file

File to write stderr, in slurm if no --output is given it will combine stdout/stderr

-j oe

In slurm joining stdout/stderr is achieved by providing just one of above

Variables defined by the resource managers

PBS

Slurm

Comments

$PBS_JOBID

$SLURM_JOB_ID

The Job ID value

$PBS_O_WORKDIR

$SLURM_SUBMIT_DIR

Directory where the job was submitted from

$PBS_NODEFILE

$SLURM_JOB_NODELIST

List of nodes assigned to job

$PBS_JOBNAME

$SLURM_JOB_NAME

The job name

$PBS_ARRAYID

$SLURM_ARRAY_TASK_ID

Job array ID (index) number

$PBS_NUM_PPN

$SLURM_CPUS_PER_TASK

Number of cores per task

CPU cores allocation

Requesting CPU cores in a PBS scheduler is done with the option -l nodes=X:ppn:Y, where it is mandatory to specify the number of nodes even for single core jobs (-l nodes=1:ppn:1). The concept behind the keyword nodes is different between PBS and Slurm though. While PBS nodes do not necessarily represent a single physical server of the cluster, the option --nodes in Slurm is directly linked to each server in the cluster as will be explained below.

In Slurm the only mandatory request for CPU resources is the number of tasks of the job, which is set with the option --ntasks=X (1 by default). Each task gets allocated one CPU core per task. If you don’t specify anything else these tasks can be distributed among any number of different nodes in the cluster.

Applications that are only capable of using multiple processors in a single server or physical computer, usually called shared memory applications (eg. parallelized using OpenMP, Pthreads, Python multiprocessing, etc…), require additional settings to ensure that all the allocated processors reside on the same node. A practical option is requesting a single task with --ntasks=1 and then ask to assign X cores in the same physical server to this task with --cpus-per-task=X.

Parallel applications based on a distributed memory paradigm, such as the ones using MPI, can be execute by just specify the option --ntasks=X where X is the total number of cores you need. CPU cores will be allocated in any fashion among the nodes in the cluster.

If you want to keep some extra control on how the tasks will be distributed in the cluster, it is possible to limit the number of nodes with the option --nodes. For instance, minimizing the amount of nodes assigned to the job can lead to better performance if the interconnect between nodes is not very fast. Imagine a cluster composed of nodes with 24 cores where you want to submit a job using 72 cores, but using precisely 3 full nodes. You can do so by asking for --ntasks=72 and adding the extra option --nodes=3.

If you want to provide some flexibility to the resources allocated to your job, it is also possible to provide a range of values to --nodes, for instance with --nodes=3-5. In such a case, cores will be allocated in any amount of nodes in the range. It could still end up allocated in just 3 full nodes, but also in other possible combination, e.g. two full nodes with their 24 cores plus three other nodes with 8 cores each.

Memory allocation

We highly recommend to specify memory allocation of your job with the Slurm option --mem-per-cpu=X, which sets the memory per core. It is also possible to request the total amount of memory per node of your job with the option --mem=X. However, requesting a proper amount of memory with --mem is not trivial for multi-node jobs in which you want to leave some freedom for node allocation. In any case, these two options are mutually exclusive, so should only use one of them. If you do not define any specific memory request, your job will get a default assignment, which is typically 1 GB per core.

Notice that by default providing just an integer value to this option is taken as the memory in megabytes. But you can specify different units using one of the following one letter suffixes: K, M, G or T. For instance, to request 2 gigabytes per core you can use --mem-per-cpu=2000 or --mem-per-cpu=2G.

Batch scripts

PBS job scripts define the resource manager directives in their preface by using the #PBS keyword. In Slurm the equivalent is the #SBATCH keyword. To illustrate its usage with some of the resource request options discussed before, we provide below a basic job script in both systems requesting a single core, 7 hours of maximum walltime and 3 gigabytes of memory:

Basic single core PBS batch script
1
2
3
4
5
6
7
8
9
#!/bin/sh
#PBS -N myjob
#PBS -l walltime=07:00:00
#PBS -l nodes=1:ppn=1
#PBS -l pmem=3gb

module load somemodule/1.1.1

my_code
Basic single core Slurm batch script
1
2
3
4
5
6
7
8
9
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --time=07:00:00
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=3000

module load somemodule/1.1.1

my_code