Job monitoring and checkpoints#

We have updated the documentation with new information on job monitoring, restarting jobs from checkpoints and managing data between VSC_DATA and VSC_SCRATCH in your jobs.

Monitoring your jobs is an important phase in all HPC workflows. We have expanded the information in the following FAQs that cover all job stages from waiting in queue to the post-mortem analysis of failed jobs:

Why is my job not starting? now shows how to use the command mysinfo to check the current load in the cluster and the available nodes in it.
How can I monitor the status of my jobs? now shows how to use the command mysqueue to check the status of your pending and running jobs in the queue, as well as opening an interactive shell in your running jobs.
Why has my job crashed? has information to get a better understanding of why a job failed unexpectedly.
How can I check my resource usage? now shows how to use the command mysacct to check the resources used by your past jobs and their efficiency.

We have also expanded the section Job Checkpoints and Restarts with new information on how to restart jobs from a checkpoint with the tool DMTCP. This is specially useful for those jobs with poor scaling that have no alternative but to exceed the time limit and use software without support for restarts. If you are in such a situation, you will find an example job script in Checkpoints with DMTCP that can be easily re-submitted to just restart your job from its last saved checkpoint.

The final piece of new documentation concerns managing data files in your jobs. The new storage for VSC_HOME, VSC_DATA and VSC_DATA_VO is very reliable, but it is significantly slower than the scratch storage of VSC_SCRATCH and VSC_SCRATCH_VO. Therefore, all jobs should access data from the fast scratch to run with optimal performance. We added a new example job script in Data in jobs that achieves this goal by using a transient working directory in VSC_SCRATCH. This working directory is created on-the-fly with all data files needed by the job. Hence, all data reads/writes during job execution are carried out on the fast scratch. Only at the very end of the job, any (selected) output files will be saved elsewhere.

02/08/2022

Recent Posts

Archives

Job monitoring and checkpoints#