3. FAQ: Troubleshooting compute jobs#
3.1. I have exceeded my $HOME disk quota, what now?#
Your account will become unusable whenever your $HOME
($VSC_HOME
) is
full. In that case you have to delete some files from your home directory, such
as temporary files, checkpoint files or any old unnecessary files until there is
enough free space.
Keep in mind that your jobs should always run from $VSC_SCRATCH
. This is the
fastest storage and provides the largest space. Just beware that it is meant to
hold the data files needed or generated by your jobs until completion, but it does
not have backups. If you need a more reliable storage with stronger data protection
and backups, you can copy the results of your jobs to $VSC_DATA
.
See also
Our documentation on HPC Data Storage
3.2. I have accidentally deleted some data, how can I restore it?#
We keep regular data snapshots at multiple points in time for all our shared
storage systems. This includes the storage of VSC_HOME
, VSC_DATA
and
VSC_SCRATCH
in your account as well as VSC_DATA_VO
and
VSC_SCRATCH_VO
in your Virtual Organization.
See also
The documentation on HPC Data Storage has a detailed description of each storage in our clusters.
Snapshots of VSC_SCRATCH
and VSC_SCRATCH_VO
are limited to the last 7
days, while VSC_HOME
, VSC_DATA
and VSC_DATA_VO
have snapshots going
back to several months. In both cases, the snapshots of the last 7 days are
kept daily. This means that any file lost in the past week, can be recovered
with at much 24 hours of missing changes.
Locating the available snapshots
Snapshots are stored in the parent folder of your storage partition. This folders are hidden and they have different names depending on the storage system (replace YYYY, MM, DD with year, month, day, respectively):
VSC_HOME:
$HOME/../.snapshot
VSC_DATA:
$VSC_DATA/../.snapshot
VSC_DATA_VO:
$VSC_DATA_VO/../.snapshot
VSC_SCRATCH:
/rhea/scratch/.snapshots/backup_snap_YYYYMMDD/brussel/${USER:3:3}/$USER
VSC_SCRATCH_VO:
/rhea/scratch/.snapshots/backup_snap_YYYYMMDD/brussel/vo/000/$VSC_VO
Checking the available snapshots
To show the available snapshots, just list the contents of the corresponding snapshot folder:
ls $HOME/../.snapshot
You will see a list of directories with names starting with
SNAP_
, followed by the date and time when the snapshot was taken. For example,SNAP_2022_05_11_111455
was taken on May 11th 2022 at 11:14:55.Restoring your files
Once you find the lost file or folder inside one of the available snapshots, you can restore it by copying it to its original location:
cp -a $HOME/../.snapshot/SNAP_2022_05_11_111455/$USER/myfile $HOME/myfile.recovered
If you need help restoring your files, please contact VUB-HPC Support
3.3. Why is my job not starting?#
It is expected that jobs will stay some time in queue, the HPC cluster is a shared system and it is not trivial to efficiently distribute its resources. If your job has been in queue for less than 24h, it is probably normal. If your job has been in queue for more than 48h, then it is very rare. Factors that can explain longer wait times are:
load of the cluster: e.g. weekends and holidays see less load
the number of jobs that you have submitted recently: we have fair share policies that will reduce the priority of users with many jobs in the queue
the requested resources of your job script: requesting many cores (>40) or GPUs (less available) can take longer
The command mysinfo
shows an overview in real time of the available
hardware resources for each partition in the cluster, including cores, memory
and GPUs, as well as their current load and running state.
CLUSTER: hydra
PARTITION STATE [NODES x CPUS] CPUS(A/I/O/T) CPU_LOAD MEMORY MB GRES GRES_USED
ampere_gpu resv [ 2 x 32 ] 0/64/0/64 0.01-0.03 246989 MB gpu:a100:2(S:1) gpu:a100:0(IDX:N/A)
ampere_gpu mix [ 3 x 32 ] 66/30/0/96 13.92-19.47 257567 MB gpu:a100:2(S:0-1) gpu:a100:2(IDX:0-1)
ampere_gpu alloc [ 3 x 32 ] 96/0/0/96 3.27-32.00 257567 MB gpu:a100:2(S:0-1) gpu:a100:2(IDX:0-1)
broadwell_himem idle [ 1 x 40 ] 0/40/0/40 0.07 1492173 MB (null) gpu:0
[...]
zen4 mix [ 13 x 64 ] 346/486/0/832 0.02-50.06 386510 MB (null) gpu:0
zen4 alloc [ 7 x 64 ] 448/0/0/448 2.03-74.64 386510 MB (null) gpu:0
Tip
The command mysinfo -N
shows a detailed overview per node.
Helpdesk If it is not clear why your job is waiting in queue, do not cancel it. Contact us instead and we will analyse the situation of your job.
3.4. How can I monitor the status of my jobs?#
3.4.1. mysqueue#
The command mysqueue
shows a detailed overview of your jobs currently in
the queue, either PENDING to start or already RUNNING.
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES CPUS MIN_MEMORY NODELIST(REASON)
1125244 ampere_gpu gpu_job01 vsc10000 RUNNING 3-01:55:38 5-00:00:00 1 16 7810M node404
1125245 ampere_gpu gpu_job02 vsc10000 PENDING 0:00 5-00:00:00 1 16 10300M (Priority)
1125246 skylake my_job01 vsc10000 RUNNING 2-19:58:16 4-23:59:00 2 40 8G node[310,320]
1125247 pascal_gpu gpu_job03 vsc10000 PENDING 0:00 3-00:00:00 1 12 230G (Resources)
Each row in the table corresponds to one of your running or pending jobs or any individual running job in your Job arrays. You can check the PARTITION where each job is running or trying to start and the resources (TIME, NODES, CPUS, MIN_MEMORY) that are/will be allocated to it.
Note
The command mysqueue -t all
will show all your jobs in the last 24 hours.
The column NODELIST(REASON) will either show the list of nodes used by a running job or the reason behind the pending state of a job. The most common reason codes are the following:
- Priority
Job is waiting for other pending jobs in front to be processed.
- Resources
Job is in front of the queue but there are no available nodes with the requested resources.
- ReqNodeNotAvail
The requested partition/nodes are not available. This usually happens on a scheduled maintenance.
See also
Full list of reason tags for pending jobs.
3.4.2. Attaching an interactive shell to a job#
Users can also inspect running jobs by attaching an interactive shell to them.
This interactive shell runs in the same compute node and environment of the
target job, allowing to monitor what is actually happening inside the job. For
instance, you can check the running processes and memory consumption in real
time with the command ps aux
.
srun --jobid=<SLURM_JOBID> --pty bash
For MPI jobs and other tasks launched with srun
, add option --overlap
:
srun --jobid=<SLURM_JOBID> --overlap --pty bash
For multi-node jobs, to get into a specific node, add option -w <node-name>
.
The list of nodes corresponding to a job can be obtained with mysqueue or
slurm_jobinfo. For example, to launch an interactive shell on
node345
, do:
srun --jobid=<SLURM_JOBID> --overlap -w node345 --pty bash
3.5. Why has my job crashed?#
There are many possible causes for a job crash. Here are just a few of the more common ones:
- Requested resources (memory or time limit) exceeded
The command
mysacct
shows an overview of the resources used by your recent jobs. You can check if the job FAILED or COMPLETED in the State column, if the Elapsed time reached the Timelimit or how much much memory did it use (MaxRSS). See also How can I check my resource usage?.- Wrong module(s) loaded
Check the syntax (case sensitive) and version of the modules. If you are using more than one module, check that they use a compatible toolchain. See The toolchain of software packages.
- Wrong input filetype / input read errors
Convert from DOS to UNIX text format with the command:
dos2unix <input_file>
3.6. How can I run a job that takes longer than the time limit?#
The time limit for calculations on Hydra is 5 days. If your job requires a longer wall time, there are a few options to consider.
Make sure to run your job on the fastest partition currently available. Check the VSCdochardware details of the Hydra partitions, and the docs on how to request a specific partition for your job.
If the software supports multiprocessing and/or multithreading, you can request more CPU cores for your job. Consult the documentation of the software for parallel execution instructions. However, to prevent wasting computational resources, you have to make sure that adding more cores gives you the expected speedup. It is very important that you perform a parallel efficiency test. For more info on parallel efficiency, see the HPC training slides.
Note
It is recommended to increase the requested RAM memory proportionally to the number of processes using
--mem-per-cpu=X
. If you don’t specify the requested memory, this is done automatically for you.Certain software packages have support for GPUs, giving a nice performance boost in some cases. Consult the documentation of the software for GPU support and how to run your calculations on a GPU, and the docs on how to submit a GPU job. Remember that software modules must be built with GPU support in order to actually use a GPU. Only modules with
CUDA
in their version string have GPU support. If there is no module with GPU support available, please contact VUB-HPC Support.It is common that scientific software provides methods to stop and restart simulations in a controlled manner. In such case, long calculations can be split into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.
If the software does not support any restart method, you can use external checkpointing. See our section on Job Checkpoints and Restarts for more details.
3.7. My jobs run slower than expected, what can I do?#
The most common scenario for low performing jobs is a mismanagement of computational resources. Some examples are:
- Generating a lot more threads/processes than available cores
Software than can parallelize over multiple cores runs optimally if the total number of active threads/processes is in line with the number of cores allocated to your job. As rule of thumb, 1 process per core. However, by default, software might use the total number of cores in the node instead of the cores available to your job to determine how many threads/processes to generate. On other instances, executing parallel software from within scripts that are already parallelized can also lead to too many threads/processes. In both cases performance will degrade.
- Jobs with barely the minimum memory to work
In situations of limited memory, but large enough to guarantee proper execution, applications might need to start swapping memory to disk to not reach an out-of-memory error. Accessing the disk is slow and should be avoided as much as possible. In this cases, increasing the requested memory to have a generous margin (typically ~20%) will allow the application to load more data in it, keeping the same efficiency and increasing performance.
Please check the list of instructions for specific software and the recommendations to lower the running time of your jobs. Those can usually be applied to jobs with degraded performance.
In very rare cases, your jobs might slow down due to jobs of other users misusing the resources of the cluster. If you suspect that this is the case, please contact VUB-HPC Support.