Job monitoring and checkpoints#

We have updated the documentation with new information on job monitoring, restarting jobs from checkpoints and managing data between VSC_DATA and VSC_SCRATCH in your jobs.

Monitoring your jobs is an important phase in all HPC workflows. We have expanded the information in the following FAQs that cover all job stages from waiting in queue to the post-mortem analysis of failed jobs:

We have also expanded the section Job Checkpoints and Restarts with new information on how to restart jobs from a checkpoint with the tool DMTCP. This is specially useful for those jobs with poor scaling that have no alternative but to exceed the time limit and use software without support for restarts. If you are in such a situation, you will find an example job script in Checkpoints with DMTCP that can be easily re-submitted to just restart your job from its last saved checkpoint.

The final piece of new documentation concerns managing data files in your jobs. The new storage for VSC_HOME, VSC_DATA and VSC_DATA_VO is very reliable, but it is significantly slower than the scratch storage of VSC_SCRATCH and VSC_SCRATCH_VO. Therefore, all jobs should access data from the fast scratch to run with optimal performance. We added a new example job script in Data in jobs that achieves this goal by using a transient working directory in VSC_SCRATCH. This working directory is created on-the-fly with all data files needed by the job. Hence, all data reads/writes during job execution are carried out on the fast scratch. Only at the very end of the job, any (selected) output files will be saved elsewhere.