2. Troubleshooting#

2.1. Issues connecting to Hydra#

If you cannot connect after following the instructions in How to Connect, here is a list of things you can check:

  1. Keep in mind that it an take up to an hour for your VSC account to become active after it has been approved; until then, connecting to Hydra will not work.

  2. Are you trying to connect from outside Belgium?

    Make sure to register your IP first, see the instructions in Connecting from outside Belgium.

  3. Check the location of your private SSH key

    It must be the same key that was generated together with your public key and that you uploaded to your VSC account page. If you can’t find the correct private key, you can create a new key pair and upload the new public key to your VSC account page. Please check VSC Docs: Reset your SSH keys if you need to start over.

    The private key must be located in the right place in your computer:

    • Windows (PuTTY or MobaXterm): select the path to your private key in the SSH settings

    • Linux, macOS: the default path is ~/.ssh/id_rsa. You can also specify a custom path with the following command:

      # replace <path_to_private_key> with the full path of your key
      # replace vsc1xxxx with your VSC account ID
      ssh -i <path_to_private_key> vsc1xxxx@login.hpc.vub.be
      
  4. Check the file permissions of your private SSH key

    The key must only have read permission for your user (the owner) and no permissions for group or other.

    • Linux, macOS users can use the following command to set the permissions right:

      chmod 400 <path_to_private_key>
      
  5. Check the format of your private SSH key:

    • MobaXterm: both OpenSSH and PuTTY (.ppk) formats are supported.

    • PuTTY: only PuTTY format is supported.

    • Linux, macOS: only OpenSSH format is supported.

    To convert your key between PuTTY and OpenSSH formats, you can use MobaXterm or PuTTYgen (bundled with PuTTY). Follow the instructions in VSC Docs: convert to OpenSSH format.

  6. Check the passphrase of your SSH key:

    Upon login you will be asked to re-enter the passphrase that you entered during generation of your SSH key pair. If you forgot this passphrase, you will have to reset your keys following the instructions in VSC Docs: Reset your SSH keys. Note that it can take up to 1 hour before you will be able to use your new SSH key. Note also that passphrase is not the same as password. If the system asks for your password, something else is wrong.

  7. Double check if you made any spelling mistakes in the hostname (login.hpc.vub.be) or your VSC account ID (vsc1xxxx: ‘vsc’ followed by 5 digits).

  8. You previously connected to Hydra from a different computer than the one you are currently using?

    Copy your private key in the correct location of the new computer, or create a second SSH key pair (and upload the second public key to the accountpage).

  9. Windows (PuTTY or MobaXterm): make sure you are using the latest PuTTY/MobaXterm version. Older versions may not support some required (security-related) features.

Helpdesk If you still cannot connect with the above suggestions, we can assist you in the login process. Linux/macOS users, please provide the full output of the command:

# replace <path_to_private_key> with the full path of your key
# replace vsc1xxxx with your VSC account ID
ssh -vvv -i <path_to_private_key> vsc1xxxx@login.hpc.vub.be

2.2. I have exceeded my $HOME disk quota, what now?#

Your account will become unusable whenever your $HOME ($VSC_HOME) is full. In that case you have to delete some files from your home directory, such as temporary files, checkpoint files or any old unnecessary files until there is enough free space.

Keep in mind that your jobs should always run from $VSC_SCRATCH. This is the fastest storage and provides the largest space. Just beware that it is meant to hold the data files needed or generated by your jobs until completion, but it does not have backups. If you need a more reliable storage with stronger data protection and backups, you can copy the results of your jobs to $VSC_DATA.

See also

Our documentation on Data Storage

2.3. I have accidentally deleted some data, how can I restore it?#

If the deleted data was in your $VSC_SCRATCH, that data is permanently lost. The storage in $VSC_SCRATCH does not have any backups.

On the other hand, data in $HOME and $VSC_DATA have daily snapshots, which are kept for at least 7 days.

  1. Checking the available snapshots

    To show the available snapshots, type the following commands:

    # show $HOME snapshots
    ls $HOME/../.snapshot
    
    # show $VSC_DATA snapshots
    ls $VSC_DATA/../.snapshot
    

    You will see a list of directories with names starting with SNAP_, followed by the date and time when the snapshot was taken. For example, SNAP_2022_05_11_111455 was taken on May 11th 2022 at 11:14:55.

  2. Restoring your files

    The following command demonstrates how you can restore file $HOME/myfile:

    # replace vsc1xxxx with your VSC-ID
    # replace X in SNAP_X with the date and time of the snapshot that you need
    cp -a $HOME/../.snapshot/SNAP_X/vsc1xxxx/myfile $HOME/myfile.SNAP_X
    

If you need help restoring your files, please contact VUB-HPC Support.

See also

Our documentation on Data Storage

2.4. Why is my job not starting?#

It is expected that jobs will stay some time in queue, the HPC cluster is a shared system and it is not trivial to efficiently distribute its resources. If your job has been in queue for less than 24h, it is probably normal. If your job has been in queue for more than 48h, then it is very rare. Factors that can explain longer wait times are:

  • load of the cluster: e.g. weekends and holidays see less load

  • the number of jobs that you have submitted recently: we have fair share policies that will reduce the priority of users with many jobs in the queue

  • the requested resources of your job script: requesting many cores (>40) or GPUs (less available) can take longer

To get an overview of the available hardware and their current load, you can issue the command:

mysinfo

Helpdesk If it is not clear why your job is waiting in queue, do not cancel it. Contact us instead and we will analyse the situation of your job.

2.5. Why has my job crashed?#

There are many possible causes for a job crash. Here are just a few of the more common ones:

Requested resources (memory or walltime) exceeded

Check the last few lines of your job output file. If the used and requested values of memory or walltime are very similar, you have probably exceeded your requested resources. Increase the corresponding resources in your job script and resubmit again.

Wrong module(s) loaded

Check the syntax (case sensitive) and version of the modules. If you are using more than one module, check that they use a compatible toolchain. See The toolchain of software packages.

Wrong input filetype / input read errors

Convert from DOS to UNIX text format with the command:

dos2unix <input_file>

2.6. How can I run a job longer than the maximum allowed wall time?#

The maximum wall time for calculations on Hydra is 5 days. If your job requires a longer wall time, there are a few options to consider.

  1. If the software supports multiprocessing and/or multithreading, you can request more CPU cores for your job. Consult the documentation of the software for parallel execution instructions. However, to prevent wasting computational resources, you have to make sure that adding more cores gives you the expected speedup. It is very important that you perform a parallel efficiency test. For more info on parallel efficiency, see the HPC training slides.

    Note

    It is recommended to increase the requested RAM memory proportionally to the number of processes using --mem-per-cpu=X. If you don’t specify the requested memory, this is done automatically for you.

  2. It is common that scientific software provides methods to stop and restart simulations in a controlled manner. In such case, long calculations can be split into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.

  3. If the software does not support any restart method, you can use external checkpointing. See our section on Restart and Checkpointing for more details.

2.7. My jobs run slower than expected, what can I do?#

The most common scenario for low performing jobs is a mismanagement of computational resources. Some examples are:

  • Generating a lot more threads/processes than available cores

    Software than can parallelize over multiple cores runs optimally if the total number of active threads/processes is in line with the number of cores allocated to your job. As rule of thumb, 1 process per core. However, by default, software might use the total number of cores in the node instead of the cores available to your job to determine how many threads/processes to generate. On other instances, executing parallel software from within scripts that are already parallelized can also lead to too many threads/processes. In both cases performance will degrade.

  • Jobs with barely the minimum memory to work

    In situations of limited memory, but large enough to guarantee proper execution, applications might need to start swapping memory to disk to not reach an out-of-memory error. Accessing the disk is slow and should be avoided as much as possible. In this cases, increasing the requested memory to have a generous margin (typically ~20%) will allow the application to load more data in it, keeping the same efficiency and increasing performance.

Please check the list of instructions for specific software and the recommendations to lower the walltime needed by your jobs. Those can usually be applied to jobs with degraded performance.

In very rare cases, your jobs might slow down due to jobs of other users misusing the resources of the cluster. If you suspect that this is the case, please contact VUB-HPC Support.