5. Job Checkpoints and Restarts#

The ability to restart a job that cannot finish before the maximum time limit depends on the capability of the underlying software/algorithm to do so. It is common that scientific software provides some mechanism to run for some amount of time, stop in a controlled manner and save intermediary data files on disk. This intermediate state is called a checkpoint. Then restarting the simulation becomes possible from that last calculated step or checkpoint.

Such methods allow to split long jobs into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.

5.1. Alternatives to checkpoints#

If your software does not support any checkpoint/restart method, there are several options to reduce the time of your jobs:

Increase the parallelization of your job: Check if you get any speed-up by requesting more cores. Scientific software commonly supports running on multiple cores in the same node (i.e. --cpus-per-task > 1). Check also if the software supports multi-node jobs (usually with MPI and --ntasks > 1).
Increase the efficiency of your job: Check for I/O bottlenecks during execution. Is your job intensively swapping data between disk and RAM memory? In that case increasing the memory of the job might improve performance, as disk access is much slower than RAM memory access.
Divide your job into parts: Even if the software does not provide any method to restart the calculation, in many cases it is possible to manually divide a job into smaller parts that can be recombined afterwards. If the parts depend on each other, save to disk any data needed by subsequent parts and submit each part sequentially in its own job.
Use faster CPUs: Hydra is an heterogeneous cluster with multiple CPU generations. You can specify that your job must run on the fastest (newest) node available by submitting it to the corresponding partition. For example, request nodes with AMD Zen 5 CPUs with -p zen5_mpi.
Use GPUs: Check if the software supports running your job on a GPU (or even 2 GPUs), and of course also check if your job gets speed-up from running on GPU(s).

5.2. Checkpoints with DMTCP#

Some jobs might have poor scaling, either by the characteristics of the simulation, by limitations of the underlying software/algorithms or by being I/O bound. This renders parallelization ineffective, as adding extra CPU cores will have a negligible effect or not be possible at all. In such cases, if the underlying software running the simulations does not support any method to stop and restart itself, hitting the maximum time limit can become unavoidable.

DMTCP is a tool that can be useful for such cases, it can create checkpoints for running applications that are saved on disk to restart execution at a later time (in another job). DMTCP works well on serial or multi-threaded applications, which are the kind of applications that typically suffer from poor scaling and commonly hit the job time limit. However, DMTCP does not support MPI or GPUs applications, in those cases increasing the parallelization of the simulation is the recommended approach to reduce the time of the job.

Since DMTCP is external to the software used in your job, it has to be integrated in the execution of your job:

Start your simulation with the command dmtcp_launch, enabling DMTCP to map the memory structures of your job and create checkpoints at certain time intervals:
```
dmtcp_launch -i <checkpoint_interval_seconds> <your_command>
```
In case of interruption of the job, restart execution from the last checkpoint and create new checkpoints with the command dmtcp_restart:
```
dmtcp_restart -i <checkpoint_interval_seconds> /path/to/checkpoint.dmtcp
```

Warning

Special attention is needed in handling the checkpoint (.dmtcp) files. They can be quite large as their size depends on the amount of memory used by your job. Therefore, always use a scratch storage to save checkpoints, check in advance that there is enough space to store them and remember to remove any checkpoint files no longer needed.

By default, DMTCP will create the checkpoint files in the same directory from where it is executed, but location of checkpoints can be changed with the option --ckptdir. DMTCP only generates one checkpoint file per job. Old checkpoints for the same job will be overwritten.

The following job script is a simple example that can automatically handle checkpoints with DMTCP. The job can be submitted to the queue as usual and if it does not finish on time for any reason, it can be re-submitted again and it will continue from the last checkpoint.

Job script with automatic checkpoints with DMTCP#

#!/bin/bash

# note: make sure that this job's name is unique
#SBATCH --job-name="unique-job-name"
#SBATCH --output="%x-%j.out"
#SBATCH --time=120:00:00
#SBATCH --partition=zen4

module load PoorScalingSoftware/1.1.1

# Time interval between checkpoints in seconds
export DMTCP_CHECKPOINT_INTERVAL=82800
# Directory for checkpoint files
CKPT_DIR="${VSC_SCRATCH:+${VSC_SCRATCH}/}checkpoints/${SLURM_JOB_NAME:-jobless}"
mkdir -p "$CKPT_DIR"

if [ -f $CKPT_DIR/*.dmtcp ]; then
    # Restart from most recent checkpoint
    CKPT_FILE=$(ls -t1 $CKPT_DIR/*dmtcp | head -n 1)
    echo "== Restarting job from checkpoint file: $CKPT_FILE"
    dmtcp_restart "$CKPT_FILE"
else
    # Start simulation
    echo "== Job checkpoints will be created in: $CKPT_DIR"
    dmtcp_launch --ckptdir "$CKPT_DIR" <poor_scaling_program>
fi

# Clean checpoints (if we reach this point, job is complete)
rm -r "$CKPT_DIR"

This job script handles checkpoint files automatically. They will be stored in a specific sub-folder in $VSC_SCRATCH/checkpoints and removed once the job is complete. Make sure that each of your jobs being checkpointed has a unique name. The name of the job is used to save the checkpoints in a folder that is specific to it, but that can be reused between runs.

The time interval between checkpoints is set in this example through the environment variable DMTCP_CHECKPOINT_INTERVAL. It is very important to not set this interval too short, as creating checkpoints does have a performance penalty and the more frequent they are the more your job will be slowed down. We recommend setting this interval to 23 hours, with a minimum of 12 hours.

Tip

To execute Java applications with DMTCP, change the signal number used for checkpointing by adding at the beginning of your job the command: export DMTCP_SIGCKPT=10

Helpdesk We can help you with issues related to checkpoints/restarts.