Job monitoring and checkpoints#
We have updated the documentation with new information on job monitoring,
restarting jobs from checkpoints and managing data between VSC_DATA
and
VSC_SCRATCH
in your jobs.
Monitoring your jobs is an important phase in all HPC workflows. We have expanded the information in the following FAQs that cover all job stages from waiting in queue to the post-mortem analysis of failed jobs:
Why is my job not starting? now shows how to use the command
mysinfo
to check the current load in the cluster and the available nodes in it.How can I monitor the status of my jobs? now shows how to use the command
mysqueue
to check the status of your pending and running jobs in the queue, as well as opening an interactive shell in your running jobs.Why has my job crashed? has information to get a better understanding of why a job failed unexpectedly.
How can I check my resource usage? now shows how to use the command
mysacct
to check the resources used by your past jobs and their efficiency.
We have also expanded the section Job Checkpoints and Restarts with new information on how to restart jobs from a checkpoint with the tool DMTCP. This is specially useful for those jobs with poor scaling that have no alternative but to exceed the time limit and use software without support for restarts. If you are in such a situation, you will find an example job script in Checkpoints with DMTCP that can be easily re-submitted to just restart your job from its last saved checkpoint.
The final piece of new documentation concerns managing data files in your jobs.
The new storage for VSC_HOME
, VSC_DATA
and
VSC_DATA_VO
is very reliable, but it is significantly slower than the
scratch storage of VSC_SCRATCH
and VSC_SCRATCH_VO
.
Therefore, all jobs should access data from the fast scratch to run with
optimal performance. We added a new example job script in
Data in jobs that achieves this goal by using a transient working
directory in VSC_SCRATCH
. This working directory is created on-the-fly with
all data files needed by the job. Hence, all data reads/writes during job
execution are carried out on the fast scratch. Only at the very end of the job,
any (selected) output files will be saved elsewhere.