2. Troubleshooting

2.1. I have exceeded my $HOME disk quota, what now?

Your account will become unusable whenever your $HOME ($VSC_HOME) is full. In that case you have to delete some files from your home directory, such as temporary files, checkpoint files or any old unnecessary files until there is enough free space.

Keep in mind that your jobs should always run from $VSC_SCRATCH. This is the fastest storage and provides the largest space. Just beware that it is meant to hold any data files needed by your jobs to complete, but it does not have any backups. If you need more storage with stronger data protection and backups, you can copy the results of your jobs to $VSC_DATA.

Warning

Long-term archival of data in Hydra is forbidden. The HPC team is not responsible for data loss, although we will do everything we can to avoid it. Users should regularly backup important data outside of Hydra and clean-up their $HOME, $VSC_DATA and $VSC_SCRATCH.

2.2. I have accidentally deleted some data, how can I restore it?

If the deleted data was on your $VSC_SCRATCH, that data is permanently lost. There storage in $VSC_SCRATCH does not have any backups.

On the other hand, data in $HOME and $VSC_DATA have backups and we keep them for 1 month. If you need a file or directory that you accidentally deleted or modified, please contact VUB-HPC Support.

Warning

Long-term archival of data in Hydra is forbidden. The HPC team is not responsible for data loss, although we will do everything we can to avoid it. Users should regularly backup important data outside of Hydra and clean-up their $HOME, $VSC_DATA and $VSC_SCRATCH.

2.3. Why is my job not starting?

It is expected that jobs will stay some time in queue, the HPC cluster is a shared system and it is not trivial to efficiently distribute its resources. If your job has been in queue for less than 24h, it is probably normal. If your job has been in queue for more than 48h, then it is very rare. Factors that can explain longer wait times are:

  • load of the cluster: e.g. weekends and holidays see less load

  • the number of jobs that you have submitted recently: we have fair share policies that will reduce the priority of users with many jobs in the queue

  • the requested resources of your job script: requesting many cores (>40) or GPUs (less available) can take longer

To get an overview of the available hardware and their current load, you can issue the command:

nodestat

Helpdesk If it is not clear why your job is waiting in queue, do not cancel it. Contact us instead and we will analyse the situation of your job.

2.4. Why has my job crashed?

There are many possible causes for a job crash. Here are just a few of the more common ones:

Requested resources (memory or walltime) exceeded

Check the last few lines of your job output file. If the used and requested values of memory or walltime are very similar, you have probably exceeded your requested resources. Increase the corresponding resources in your job script and resubmit again.

Wrong module(s) loaded

Check the syntax (case sensitive) and version of the modules. If you are using more than one module, check that they use a compatible toolchain. See The toolchain of software packages.

Wrong input filetype / input read errors

Convert from DOS to UNIX text format with the command:

dos2unix <input_file>

2.5. Jobs on the GPU nodes fail with “all CUDA-capable devices are busy or unavailable”

By default the GPU cards operate in process exclusive mode, meaning that only one process at a time can use the GPU. Hence, the GPU will appear as busy or unavailable to any process trying to use it after the first one. Calculation with multiple processes running in several cores can share a single GPU using one of the following methods

  1. Recommended Launch the CUDA MPS daemon at beginning of your script and then just continue with your normal job script. It will automatically handle the multiple resources and coordinate access to the GPUs:

    1
    2
    3
    4
    5
    6
    7
     #!/bin/bash
     #PBS -l nodes=1:ppn=2:gpus=1
    
     export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps
     nvidia-cuda-mps-control -d
    
     your_script_instructions
    
  2. Put the GPU in shared mode. An unlimited number of processes can then use the GPU:

    1
    2
    3
    4
     #!/bin/bash
     #PBS -l nodes=1:ppn=4:gpus=1:shared
    
     your_script_instructions
    

2.6. How can I run a job longer than the maximum allowed wall time?

The maximum wall time for calculations on Hydra is 5 days. If your job requires a longer wall time, there are a few options to consider.

  1. If the software supports multiprocessing and/or multithreading, you can request more CPU cores for your job. Consult the documentation of the software for parallel execution instructions. However, to prevent wasting computational resources, you have to make sure that adding more cores gives you the expected speedup. It is very important that you perform a parallel efficiency test. For more info on parallel efficiency, see the HPC training slides.

    Note

    It may be necessary to increase the requested RAM memory proportionally to the number of processes.

  2. It is common that scientific software provides methods to stop and restart simulations in a controlled manner. In such case, long calculations can be split into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.

  3. If the software does not support any restart method, you can use external checkpointing. See our section on Job Restart and Checkpointing for more details.