2. Troubleshooting#

2.1. Issues connecting to Hydra#

If you cannot connect after following the instructions in How to Connect, here is a list of things you can check:

  1. Changes in your VSC account can take up to an hour to become active in the HPC clusters

    This includes the approval of your account, new uploaded public keys, as well as memberships to user groups and Virtual Organizations.

  2. Are you trying to connect from outside Belgium?

    Make sure to first register your IP in the HPC firewall, see the instructions in Connecting from outside Belgium.

  3. Check the location of your private SSH key

    Verify that the file of your private key corresponds to the same key that was generated together with the public key that you uploaded to your VSC account page.

    If you can’t find the correct private key, you can create a new key pair and upload the new public key to your VSC account page. Please check VSCdocReset your SSH keys if you need to start over.

    The location of your private key in your computer must be known to your SSH client:

    • Windows (PuTTY or MobaXterm): select the path to your private key in the SSH settings

    • Linux, macOS: the recommended path is ~/.ssh/id_rsa_vsc. Make sure to configure SSH to use the key in this path whenever you use your VSC-id by following the instructions in VSCdoclink your key and VSC-id.

      Alternatively, you can manually specify a custom location of your key to SSH with the following command:

      # replace <path_to_private_key> with the full path of your key
      # replace vsc1xxxx with your VSC account ID
      ssh -i <path_to_private_key> vsc1xxxx@login.hpc.vub.be
      
  4. Check the file permissions of your private SSH key

    The file of your private key must only have read permission for your user (the owner) and no permissions for group or other.

    Command for Linux and macOS users to set the correct permissions#
      chmod 400 <path_to_private_key>
    
  5. Check the format of your private SSH key

    • MobaXterm: both OpenSSH and PuTTY (.ppk) formats are supported.

    • PuTTY: only PuTTY format is supported.

    • Linux, macOS: only OpenSSH format is supported.

    On Windows computers, keys can be converted between PuTTY and OpenSSH formats with MobaXterm or PuTTYgen (bundled with PuTTY). Follow the instructions in VSCdocconvert to OpenSSH format.

  6. Check the passphrase of your SSH key

    Upon login you will be asked to re-enter the passphrase that you entered during generation of your SSH key pair. If you forgot this passphrase, you will have to reset your keys following the instructions in VSCdocReset your SSH keys. Note that it can take up to 1 hour before you will be able to use your new SSH key. Note also that passphrase is not the same as password. If the system asks for your password, something else is wrong.

  7. Double check for spelling mistakes

    Common mistakes are the hostname of the login server (e.g. login.hpc.vub.be) or your VSC account ID (vsc1xxxx: ‘vsc’ followed by 5 digits).

  8. You previously connected to Hydra from a different computer than the one you are currently using?

    There are two solutions:

    • copy your existing private key to the new computer in the correct location

    • create a new SSH key pair and upload this new public key as your second key to your VSC account page

  9. Check software updates on your computer

    Software updates can be critical on Windows, as older versions of PuTTY or MobaXterm may not support some required (security-related) features.

Helpdesk If you still cannot connect with the above suggestions, we can assist you in the login process. Linux/macOS users, please provide the full output of the command:

# replace <path_to_private_key> with the full path of your key
# replace vsc1xxxx with your VSC account ID
ssh -vvv -i <path_to_private_key> vsc1xxxx@login.hpc.vub.be

2.2. I have exceeded my $HOME disk quota, what now?#

Your account will become unusable whenever your $HOME ($VSC_HOME) is full. In that case you have to delete some files from your home directory, such as temporary files, checkpoint files or any old unnecessary files until there is enough free space.

Keep in mind that your jobs should always run from $VSC_SCRATCH. This is the fastest storage and provides the largest space. Just beware that it is meant to hold the data files needed or generated by your jobs until completion, but it does not have backups. If you need a more reliable storage with stronger data protection and backups, you can copy the results of your jobs to $VSC_DATA.

See also

Our documentation on Data Storage

2.3. I have accidentally deleted some data, how can I restore it?#

We keep regular data snapshots at multiple points in time for all our shared storage systems. This includes the storage of VSC_HOME, VSC_DATA and VSC_SCRATCH in your account as well as VSC_DATA_VO and VSC_SCRATCH_VO in your Virtual Organization.

See also

The documentation on Data Storage has a detailed description of each storage in our clusters.

Snapshots of VSC_SCRATCH and VSC_SCRATCH_VO are limited to the last 7 days, while VSC_HOME, VSC_DATA and VSC_DATA_VO have snapshots going back to several months. In both cases, the snapshots of the last 7 days are kept daily. This means that any file lost in the past week, can be recovered with at much 24 hours of missing changes.

  1. Locating the available snapshots

    Snapshots are stored in the parent folder of your storage partition. This folders are hidden and they have different names depending on the storage system (replace YYYY, MM, DD with year, month, day, respectively):

    • VSC_HOME: $HOME/../.snapshot

    • VSC_DATA: $VSC_DATA/../.snapshot

    • VSC_DATA_VO: $VSC_DATA_VO/../.snapshot

    • VSC_SCRATCH: /rhea/scratch/.snapshots/backup_snap_YYYYMMDD/brussel/${USER:3:3}/$USER

    • VSC_SCRATCH_VO: /rhea/scratch/.snapshots/backup_snap_YYYYMMDD/brussel/vo/000/$VSC_VO

  2. Checking the available snapshots

    To show the available snapshots, just list the contents of the corresponding snapshot folder:

    Show snapshots of your home directory#
    ls $HOME/../.snapshot
    

    You will see a list of directories with names starting with SNAP_, followed by the date and time when the snapshot was taken. For example, SNAP_2022_05_11_111455 was taken on May 11th 2022 at 11:14:55.

  3. Restoring your files

    Once you find the lost file or folder inside one of the available snapshots, you can restore it by copying it to its original location:

    Example to restore file $HOME/myfile from snapshot SNAP_2022_05_11_111455#
    cp -a $HOME/../.snapshot/SNAP_2022_05_11_111455/$USER/myfile $HOME/myfile.recovered
    

If you need help restoring your files, please contact VUB-HPC Support

2.4. Why is my job not starting?#

It is expected that jobs will stay some time in queue, the HPC cluster is a shared system and it is not trivial to efficiently distribute its resources. If your job has been in queue for less than 24h, it is probably normal. If your job has been in queue for more than 48h, then it is very rare. Factors that can explain longer wait times are:

  • load of the cluster: e.g. weekends and holidays see less load

  • the number of jobs that you have submitted recently: we have fair share policies that will reduce the priority of users with many jobs in the queue

  • the requested resources of your job script: requesting many cores (>40) or GPUs (less available) can take longer

The command mysinfo shows an overview in real time of the available hardware resources for each partition in the cluster, including cores, memory and GPUs, as well as their current load and running state.

Example output of mysinfo#
 CLUSTER: hydra
 PARTITION       STATE [NODES x CPUS]   CPUS(A/I/O/T)     CPU_LOAD   MEMORY MB  GRES                GRES_USED
 ampere_gpu      resv  [    2 x 32  ]       0/64/0/64    0.01-0.03   246989 MB  gpu:a100:2(S:1)     gpu:a100:0(IDX:N/A)
 ampere_gpu      mix   [    3 x 32  ]      66/30/0/96  13.92-19.47   257567 MB  gpu:a100:2(S:0-1)   gpu:a100:2(IDX:0-1)
 ampere_gpu      alloc [    3 x 32  ]       96/0/0/96   3.27-32.00   257567 MB  gpu:a100:2(S:0-1)   gpu:a100:2(IDX:0-1)
 broadwell       mix   [   12 x 28  ]   231/105/0/336   0.01-48.82   257703 MB  (null)              gpu:0
 broadwell       alloc [   15 x 28  ]     420/0/0/420   3.55-19.22   257703 MB  (null)              gpu:0
 broadwell_himem idle  [    1 x 40  ]       0/40/0/40         0.07  1492173 MB  (null)              gpu:0

Tip

The command mysinfo -N shows a detailed overview per node.

Helpdesk If it is not clear why your job is waiting in queue, do not cancel it. Contact us instead and we will analyse the situation of your job.

2.5. How can I monitor the status of my jobs?#

2.5.1. mysqueue#

The command mysqueue shows a detailed overview of your jobs currently in the queue, either PENDING to start or already RUNNING.

Example output of mysqueue#
  JOBID PARTITION   NAME           USER     STATE       TIME TIME_LIMIT NODES CPUS MIN_MEMORY NODELIST(REASON)
1125244 ampere_gpu  gpu_job01  vsc10000   RUNNING 3-01:55:38 5-00:00:00     1   16      7810M node404
1125245 ampere_gpu  gpu_job02  vsc10000   PENDING       0:00 5-00:00:00     1   16     10300M (Priority)
1125246 skylake     my_job01   vsc10000   RUNNING 2-19:58:16 4-23:59:00     2   40         8G node[310,320]
1125247 pascal_gpu  gpu_job03  vsc10000   PENDING       0:00 3-00:00:00     1   12       230G (Resources)

Each row in the table corresponds to one of your running or pending jobs or any individual running job in your Job arrays. You can check the PARTITION where each job is running or trying to start and the resources (TIME, NODES, CPUS, MIN_MEMORY) that are/will be allocated to it.

Note

The command mysqueue -t all will show all your jobs in the last 24 hours.

The column NODELIST(REASON) will either show the list of nodes used by a running job or the reason behind the pending state of a job. The most common reason codes are the following:

Priority

Job is waiting for other pending jobs in front to be processed.

Resources

Job is in front of the queue but there are no available nodes with the requested resources.

ReqNodeNotAvail

The requested partition/nodes are not available. This usually happens on a scheduled maintenance.

See also

Full list of reason tags for pending jobs.

2.5.2. Attaching an interactive shell to a job#

Users can also inspect running jobs by attaching an interactive shell to them. This interactive shell runs in the same compute node and environment of the target job, allowing to monitor what is actually happening inside the job. For instance, you can check the running processes and memory consumption in real time with the command ps aux.

Launch interactive shell attached to running job#
srun --jobid=<SLURM_JOBID> --pty bash

For MPI jobs and other tasks launched with srun, add option --overlap:

Launch interactive shell attached to running srun task#
srun --jobid=<SLURM_JOBID> --overlap --pty bash

For multi-node jobs, to get into a specific node, add option -w <node-name>. The list of nodes corresponding to a job can be obtained with mysqueue or slurm_jobinfo. For example, to launch an interactive shell on node345, do:

Launch interactive shell attached to running srun task on a specific node#
srun --jobid=<SLURM_JOBID> --overlap -w node345 --pty bash

2.6. Why has my job crashed?#

There are many possible causes for a job crash. Here are just a few of the more common ones:

Requested resources (memory or time limit) exceeded

The command mysacct shows an overview of the resources used by your recent jobs. You can check if the job FAILED or COMPLETED in the State column, if the Elapsed time reached the Timelimit or how much much memory did it use (MaxRSS). See also How can I check my resource usage?.

Wrong module(s) loaded

Check the syntax (case sensitive) and version of the modules. If you are using more than one module, check that they use a compatible toolchain. See The toolchain of software packages.

Wrong input filetype / input read errors

Convert from DOS to UNIX text format with the command:

dos2unix <input_file>

2.7. How can I run a job longer than the maximum allowed wall time?#

The maximum wall time for calculations on Hydra is 5 days. If your job requires a longer wall time, there are a few options to consider.

  1. If the software supports multiprocessing and/or multithreading, you can request more CPU cores for your job. Consult the documentation of the software for parallel execution instructions. However, to prevent wasting computational resources, you have to make sure that adding more cores gives you the expected speedup. It is very important that you perform a parallel efficiency test. For more info on parallel efficiency, see the HPC training slides.

    Note

    It is recommended to increase the requested RAM memory proportionally to the number of processes using --mem-per-cpu=X. If you don’t specify the requested memory, this is done automatically for you.

  2. It is common that scientific software provides methods to stop and restart simulations in a controlled manner. In such case, long calculations can be split into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.

  3. If the software does not support any restart method, you can use external checkpointing. See our section on Job Checkpoints and Restarts for more details.

2.8. My jobs run slower than expected, what can I do?#

The most common scenario for low performing jobs is a mismanagement of computational resources. Some examples are:

  • Generating a lot more threads/processes than available cores

    Software than can parallelize over multiple cores runs optimally if the total number of active threads/processes is in line with the number of cores allocated to your job. As rule of thumb, 1 process per core. However, by default, software might use the total number of cores in the node instead of the cores available to your job to determine how many threads/processes to generate. On other instances, executing parallel software from within scripts that are already parallelized can also lead to too many threads/processes. In both cases performance will degrade.

  • Jobs with barely the minimum memory to work

    In situations of limited memory, but large enough to guarantee proper execution, applications might need to start swapping memory to disk to not reach an out-of-memory error. Accessing the disk is slow and should be avoided as much as possible. In this cases, increasing the requested memory to have a generous margin (typically ~20%) will allow the application to load more data in it, keeping the same efficiency and increasing performance.

Please check the list of instructions for specific software and the recommendations to lower the running time of your jobs. Those can usually be applied to jobs with degraded performance.

In very rare cases, your jobs might slow down due to jobs of other users misusing the resources of the cluster. If you suspect that this is the case, please contact VUB-HPC Support.