3. FAQ: Troubleshooting compute jobs#

3.1. I have exceeded my $HOME disk quota, what now?#

Your account will become unusable whenever your $HOME ($VSC_HOME) is full. In that case you have to delete some files from your home directory, such as temporary files, checkpoint files or any old unnecessary files until there is enough free space.

Keep in mind that your jobs should always run from $VSC_SCRATCH. This is the fastest storage and provides the largest space. Just beware that it is meant to hold the data files needed or generated by your jobs until completion, but it does not have backups. If you need a more reliable storage with stronger data protection and backups, you can copy the results of your jobs to $VSC_DATA.

See also

Our documentation on HPC Data Storage

3.2. I have accidentally deleted some data, how can I restore it?#

We keep regular data snapshots at multiple points in time for all our shared storage systems. This includes the storage of VSC_HOME, VSC_DATA and VSC_SCRATCH in your account as well as VSC_DATA_VO and VSC_SCRATCH_VO in your Virtual Organization.

See also

The documentation on HPC Data Storage has a detailed description of each storage in our clusters.

Snapshots of VSC_SCRATCH and VSC_SCRATCH_VO are limited to the last 7 days, while VSC_HOME, VSC_DATA and VSC_DATA_VO have snapshots going back to several months. In both cases, the snapshots of the last 7 days are kept daily. This means that any file lost in the past week, can be recovered with at much 24 hours of missing changes.

  1. Locating the available snapshots

    Snapshots are stored in the parent folder of your storage partition. This folders are hidden and they have different names depending on the storage system (replace YYYY, MM, DD with year, month, day, respectively):

    • VSC_HOME: $HOME/../.snapshot

    • VSC_DATA: $VSC_DATA/../.snapshot

    • VSC_DATA_VO: $VSC_DATA_VO/../.snapshot

    • VSC_SCRATCH: /rhea/scratch/.snapshots/backup_snap_YYYYMMDD/brussel/${USER:3:3}/$USER

    • VSC_SCRATCH_VO: /rhea/scratch/.snapshots/backup_snap_YYYYMMDD/brussel/vo/000/$VSC_VO

  2. Checking the available snapshots

    To show the available snapshots, just list the contents of the corresponding snapshot folder:

    Show snapshots of your home directory#
    ls $HOME/../.snapshot
    

    You will see a list of directories with names starting with SNAP_, followed by the date and time when the snapshot was taken. For example, SNAP_2022_05_11_111455 was taken on May 11th 2022 at 11:14:55.

  3. Restoring your files

    Once you find the lost file or folder inside one of the available snapshots, you can restore it by copying it to its original location:

    Example to restore file $HOME/myfile from snapshot SNAP_2022_05_11_111455#
    cp -a $HOME/../.snapshot/SNAP_2022_05_11_111455/$USER/myfile $HOME/myfile.recovered
    

If you need help restoring your files, please contact VUB-HPC Support

3.3. Why is my job not starting?#

It is expected that jobs will stay some time in queue, the HPC cluster is a shared system and it is not trivial to efficiently distribute its resources. If your job has been in queue for less than 24h, it is probably normal. If your job has been in queue for more than 48h, then it is very rare. Factors that can explain longer wait times are:

  • load of the cluster: e.g. weekends and holidays see less load

  • the number of jobs that you have submitted recently: we have fair share policies that will reduce the priority of users with many jobs in the queue

  • the requested resources of your job script: requesting many cores (>40) or GPUs (less available) can take longer

The command mysinfo shows an overview in real time of the available hardware resources for each partition in the cluster, including cores, memory and GPUs, as well as their current load and running state.

Example output of mysinfo#
 CLUSTER: hydra
 PARTITION       STATE [NODES x CPUS]   CPUS(A/I/O/T)     CPU_LOAD   MEMORY MB  GRES                GRES_USED
 ampere_gpu      resv  [    2 x 32  ]       0/64/0/64    0.01-0.03   246989 MB  gpu:a100:2(S:1)     gpu:a100:0(IDX:N/A)
 ampere_gpu      mix   [    3 x 32  ]      66/30/0/96  13.92-19.47   257567 MB  gpu:a100:2(S:0-1)   gpu:a100:2(IDX:0-1)
 ampere_gpu      alloc [    3 x 32  ]       96/0/0/96   3.27-32.00   257567 MB  gpu:a100:2(S:0-1)   gpu:a100:2(IDX:0-1)
 broadwell_himem idle  [    1 x 40  ]       0/40/0/40         0.07  1492173 MB  (null)              gpu:0
 [...]
 zen4            mix   [   13 x 64  ]   346/486/0/832   0.02-50.06   386510 MB  (null)              gpu:0
 zen4            alloc [    7 x 64  ]     448/0/0/448   2.03-74.64   386510 MB  (null)              gpu:0

Tip

The command mysinfo -N shows a detailed overview per node.

Helpdesk If it is not clear why your job is waiting in queue, do not cancel it. Contact us instead and we will analyse the situation of your job.

3.4. How can I monitor the status of my jobs?#

3.4.1. mysqueue#

The command mysqueue shows a detailed overview of your jobs currently in the queue, either PENDING to start or already RUNNING.

Example output of mysqueue#
  JOBID PARTITION   NAME           USER     STATE       TIME TIME_LIMIT NODES CPUS MIN_MEMORY NODELIST(REASON)
1125244 ampere_gpu  gpu_job01  vsc10000   RUNNING 3-01:55:38 5-00:00:00     1   16      7810M node404
1125245 ampere_gpu  gpu_job02  vsc10000   PENDING       0:00 5-00:00:00     1   16     10300M (Priority)
1125246 skylake     my_job01   vsc10000   RUNNING 2-19:58:16 4-23:59:00     2   40         8G node[310,320]
1125247 pascal_gpu  gpu_job03  vsc10000   PENDING       0:00 3-00:00:00     1   12       230G (Resources)

Each row in the table corresponds to one of your running or pending jobs or any individual running job in your Job arrays. You can check the PARTITION where each job is running or trying to start and the resources (TIME, NODES, CPUS, MIN_MEMORY) that are/will be allocated to it.

Note

The command mysqueue -t all will show all your jobs in the last 24 hours.

The column NODELIST(REASON) will either show the list of nodes used by a running job or the reason behind the pending state of a job. The most common reason codes are the following:

Priority

Job is waiting for other pending jobs in front to be processed.

Resources

Job is in front of the queue but there are no available nodes with the requested resources.

ReqNodeNotAvail

The requested partition/nodes are not available. This usually happens on a scheduled maintenance.

See also

Full list of reason tags for pending jobs.

3.4.2. Attaching an interactive shell to a job#

Users can also inspect running jobs by attaching an interactive shell to them. This interactive shell runs in the same compute node and environment of the target job, allowing to monitor what is actually happening inside the job. For instance, you can check the running processes and memory consumption in real time with the command ps aux.

Launch interactive shell attached to running job#
srun --jobid=<SLURM_JOBID> --pty bash

For MPI jobs and other tasks launched with srun, add option --overlap:

Launch interactive shell attached to running srun task#
srun --jobid=<SLURM_JOBID> --overlap --pty bash

For multi-node jobs, to get into a specific node, add option -w <node-name>. The list of nodes corresponding to a job can be obtained with mysqueue or slurm_jobinfo. For example, to launch an interactive shell on node345, do:

Launch interactive shell attached to running srun task on a specific node#
srun --jobid=<SLURM_JOBID> --overlap -w node345 --pty bash

3.5. Why has my job crashed?#

There are many possible causes for a job crash. Here are just a few of the more common ones:

Requested resources (memory or time limit) exceeded

The command mysacct shows an overview of the resources used by your recent jobs. You can check if the job FAILED or COMPLETED in the State column, if the Elapsed time reached the Timelimit or how much much memory did it use (MaxRSS). See also How can I check my resource usage?.

Wrong module(s) loaded

Check the syntax (case sensitive) and version of the modules. If you are using more than one module, check that they use a compatible toolchain. See The toolchain of software packages.

Wrong input filetype / input read errors

Convert from DOS to UNIX text format with the command:

dos2unix <input_file>

3.6. How can I run a job that takes longer than the time limit?#

The time limit for calculations on Hydra is 5 days. If your job requires a longer wall time, there are a few options to consider.

  1. Make sure to run your job on the fastest partition currently available. Check the VSCdochardware details of the Hydra partitions, and the docs on how to request a specific partition for your job.

  2. If the software supports multiprocessing and/or multithreading, you can request more CPU cores for your job. Consult the documentation of the software for parallel execution instructions. However, to prevent wasting computational resources, you have to make sure that adding more cores gives you the expected speedup. It is very important that you perform a parallel efficiency test. For more info on parallel efficiency, see the HPC training slides.

    Note

    It is recommended to increase the requested RAM memory proportionally to the number of processes using --mem-per-cpu=X. If you don’t specify the requested memory, this is done automatically for you.

  3. Certain software packages have support for GPUs, giving a nice performance boost in some cases. Consult the documentation of the software for GPU support and how to run your calculations on a GPU, and the docs on how to submit a GPU job. Remember that software modules must be built with GPU support in order to actually use a GPU. Only modules with CUDA in their version string have GPU support. If there is no module with GPU support available, please contact VUB-HPC Support.

  4. It is common that scientific software provides methods to stop and restart simulations in a controlled manner. In such case, long calculations can be split into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.

  5. If the software does not support any restart method, you can use external checkpointing. See our section on Job Checkpoints and Restarts for more details.

3.7. My jobs run slower than expected, what can I do?#

The most common scenario for low performing jobs is a mismanagement of computational resources. Some examples are:

  • Generating a lot more threads/processes than available cores

    Software than can parallelize over multiple cores runs optimally if the total number of active threads/processes is in line with the number of cores allocated to your job. As rule of thumb, 1 process per core. However, by default, software might use the total number of cores in the node instead of the cores available to your job to determine how many threads/processes to generate. On other instances, executing parallel software from within scripts that are already parallelized can also lead to too many threads/processes. In both cases performance will degrade.

  • Jobs with barely the minimum memory to work

    In situations of limited memory, but large enough to guarantee proper execution, applications might need to start swapping memory to disk to not reach an out-of-memory error. Accessing the disk is slow and should be avoided as much as possible. In this cases, increasing the requested memory to have a generous margin (typically ~20%) will allow the application to load more data in it, keeping the same efficiency and increasing performance.

Please check the list of instructions for specific software and the recommendations to lower the running time of your jobs. Those can usually be applied to jobs with degraded performance.

In very rare cases, your jobs might slow down due to jobs of other users misusing the resources of the cluster. If you suspect that this is the case, please contact VUB-HPC Support.