4. Job Checkpoints and Restarts#

The ability to restart a job that cannot finish before the maximum time limit depends on the capability of the underlying software/algorithm to do so. It is common that scientific software provides some mechanism to run for some amount of time, stop in a controlled manner and save intermediary data files on disk. This intermediate state is called a checkpoint. Then restarting the simulation becomes possible from that last calculated step or checkpoint.

Such methods allow to split long jobs into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.

4.1. Alternatives to checkpoints#

If your software does not support any checkpoint/restart method, there are several options to reduce the time of your jobs:

Increase the parallelization of your job

Check if you get any speed-up by requesting more cores. Scientific software commonly supports running on multiple cores in the same node (i.e. --cpus-per-task > 1). Check also if the software supports multi-node jobs (usually with MPI and --ntasks > 1).

Increase the efficiency of your job

Check for I/O bottlenecks during execution. Is your job intensively swapping data between disk and RAM memory? In that case increasing the memory of the job might improve performance, as disk access is much slower than RAM memory access.

Divide your job into parts

Even if the software does not provide any method to restart the calculation, in many cases it is possible to manually divide a job into smaller parts that can be recombined afterwards. If the parts depend on each other, save to disk any data needed by subsequent parts and submit each part sequentially in its own job.

Use faster CPUs

Hydra is an heterogeneous cluster with multiple CPU generations. You can specify that your job must run on the fastest (newest) node available by submitting it to the corresponding partition. For instance, request nodes with Intel Skylake CPUs with -p skylake.

Use GPUs

Check if the software supports running your job on a GPU (or even 2 GPUs), and of course also check if your job gets speed-up from running on GPU(s).

4.2. Checkpoints with DMTCP#

Some jobs might have poor scaling, either by the characteristics of the simulation, by limitations of the underlying software/algorithms or by being I/O bound. This renders parallelization ineffective, as adding extra CPU cores will have a negligible effect or not be possible at all. In such cases, if the underlying software running the simulations does not support any method to stop and restart itself, hitting the maximum time limit can become unavoidable.

DMTCP is a tool that can be useful for those jobs with poor scaling, it can create checkpoints and restart the execution of other programs. Therefore, it can be potentially applied with any other software module in the cluster. It works well on single threaded and multi-threaded non-MPI applications. It also can work with some MPI implementations, but results with MPI applications may vary. One known limitation is that it does not work with GPUs. Nonetheless, the main domain of application are jobs with poor scaling and those commonly use single threaded or multi-threaded software, precisely where DMTCP works best.

Since DMTCP is external to the software used in your job, it has to be integrated in the execution of your job:

  1. Start your simulation with the command dmtcp_launch, enabling DMTCP to map the memory structures of your job and create checkpoints at certain time intervals:

    dmtcp_launch -i <checkpoint_interval_seconds> <your_command>
    
  2. In case of interruption of the job, restart execution from the last checkpoint and create new checkpoints with the command dmtcp_restart:

    dmtcp_restart -i <checkpoint_interval_seconds> /path/to/checkpoint.dmtcp
    

Warning

Special attention is needed in handling the checkpoint (.dmtcp) files. They can be quite large as their size depends on the amount of memory used by your job. Therefore, always use a scratch storage to save checkpoints, check in advance that there is enough space to store them and remember to remove any checkpoint files no longer needed.

By default, DMTCP will create the checkpoint files in the same directory from where it is executed, but location of checkpoints can be changed with the option --ckptdir. DMTCP only generates one checkpoint file per job. Old checkpoints for the same job will be overwritten.

The following job script is a simple example that can automatically handle checkpoints with DMTCP. The job can be submitted to the queue as usual and if it does not finish on time for any reason, it can be re-submitted again and it will continue from the last checkpoint.

Job script with automatic checkpoints with DMTCP#
 1#!/bin/bash
 2
 3# note: make sure that this job's name is unique
 4#SBATCH --job-name="unique-job-name"
 5#SBATCH --output="%x-%j.out"
 6#SBATCH --time=120:00:00
 7#SBATCH --partition=skylake
 8
 9module load PoorScalingSoftware/1.1.1
10
11# Time interval between checkpoints in seconds
12export DMTCP_CHECKPOINT_INTERVAL=82800
13# Directory for checkpoint files
14CKPT_DIR="${VSC_SCRATCH:+${VSC_SCRATCH}/}checkpoints/${SLURM_JOB_NAME:-jobless}"
15mkdir -p "$CKPT_DIR"
16
17if [ -f $CKPT_DIR/*.dmtcp ]; then
18    # Restart from most recent checkpoint
19    CKPT_FILE=$(ls -t1 $CKPT_DIR/*dmtcp | head -n 1)
20    echo "== Restarting job from checkpoint file: $CKPT_FILE"
21    dmtcp_restart "$CKPT_FILE"
22else
23    # Start simulation
24    echo "== Job checkpoints will be created in: $CKPT_DIR"
25    dmtcp_launch --ckptdir "$CKPT_DIR" <poor_scaling_program>
26fi
27
28# Clean checpoints (if we reach this point, job is complete)
29rm -r "$CKPT_DIR"

This job script handles checkpoint files automatically. They will be stored in a specific sub-folder in $VSC_SCRATCH/checkpoints and removed once the job is complete. Make sure that each of your jobs being checkpointed has a unique name. The name of the job is used to save the checkpoints in a folder that is specific to it, but that can be reused between runs.

The time interval between checkpoints is set in this example through the environment variable DMTCP_CHECKPOINT_INTERVAL. It is very important to not set this interval too short, as creating checkpoints does have a performance penalty and the more frequent they are the more your job will be slowed down. We recommend setting this interval to 23 hours, with a minimum of 12 hours.

Tip

To execute Java applications with DMTCP, change the signal number used for checkpointing by adding at the beginning of your job the command: export DMTCP_SIGCKPT=10

Helpdesk We can help you with issues related to checkpoints/restarts.