2. Troubleshooting#
2.1. Issues connecting to Hydra#
If you cannot connect after following the instructions in How to Connect, here is a list of things you can check:
Changes in your VSC account can take up to an hour to become active in the HPC clusters
This includes the approval of your account, new uploaded public keys, as well as memberships to user groups and Virtual Organizations.
Are you trying to connect from outside Belgium?
Make sure to first register your IP in the HPC firewall, see the instructions in Connecting from outside Belgium.
Check the location of your private SSH key
Verify that the file of your private key corresponds to the same key that was generated together with the public key that you uploaded to your VSC account page.
If you can’t find the correct private key, you can create a new key pair and upload the new public key to your VSC account page. Please check VSCdocReset your SSH keys if you need to start over.
The location of your private key in your computer must be known to your SSH client:
Windows (PuTTY or MobaXterm): select the path to your private key in the SSH settings
Linux, macOS: the recommended path is
~/.ssh/id_rsa_vsc
. Make sure to configure SSH to use the key in this path whenever you use your VSC-id by following the instructions in VSCdoclink your key and VSC-id.Alternatively, you can manually specify a custom location of your key to SSH with the following command:
# replace <path_to_private_key> with the full path of your key # replace vsc1xxxx with your VSC account ID ssh -i <path_to_private_key> vsc1xxxx@login.hpc.vub.be
Check the file permissions of your private SSH key
The file of your private key must only have read permission for your user (the owner) and no permissions for group or other.
chmod 400 <path_to_private_key>
Check the format of your private SSH key
MobaXterm: both OpenSSH and PuTTY (
.ppk
) formats are supported.PuTTY: only PuTTY format is supported.
Linux, macOS: only OpenSSH format is supported.
On Windows computers, keys can be converted between PuTTY and OpenSSH formats with MobaXterm or PuTTYgen (bundled with PuTTY). Follow the instructions in VSCdocconvert to OpenSSH format.
Check the passphrase of your SSH key
Upon login you will be asked to re-enter the passphrase that you entered during generation of your SSH key pair. If you forgot this passphrase, you will have to reset your keys following the instructions in VSCdocReset your SSH keys. Note that it can take up to 1 hour before you will be able to use your new SSH key. Note also that passphrase is not the same as password. If the system asks for your password, something else is wrong.
Double check for spelling mistakes
Common mistakes are the hostname of the login server (e.g.
login.hpc.vub.be
) or your VSC account ID (vsc1xxxx
: ‘vsc’ followed by 5 digits).You previously connected to Hydra from a different computer than the one you are currently using?
There are two solutions:
copy your existing private key to the new computer in the correct location
create a new SSH key pair and upload this new public key as your second key to your VSC account page
Check software updates on your computer
Software updates can be critical on Windows, as older versions of PuTTY or MobaXterm may not support some required (security-related) features.
Helpdesk If you still cannot connect with the above suggestions, we can assist you in the login process. Linux/macOS users, please provide the full output of the command:
# replace <path_to_private_key> with the full path of your key
# replace vsc1xxxx with your VSC account ID
ssh -vvv -i <path_to_private_key> vsc1xxxx@login.hpc.vub.be
2.2. I have exceeded my $HOME disk quota, what now?#
Your account will become unusable whenever your $HOME
($VSC_HOME
) is
full. In that case you have to delete some files from your home directory, such
as temporary files, checkpoint files or any old unnecessary files until there is
enough free space.
Keep in mind that your jobs should always run from $VSC_SCRATCH
. This is the
fastest storage and provides the largest space. Just beware that it is meant to
hold the data files needed or generated by your jobs until completion, but it does
not have backups. If you need a more reliable storage with stronger data protection
and backups, you can copy the results of your jobs to $VSC_DATA
.
See also
Our documentation on HPC Data Storage
2.3. I have accidentally deleted some data, how can I restore it?#
We keep regular data snapshots at multiple points in time for all our shared
storage systems. This includes the storage of VSC_HOME
, VSC_DATA
and
VSC_SCRATCH
in your account as well as VSC_DATA_VO
and
VSC_SCRATCH_VO
in your Virtual Organization.
See also
The documentation on HPC Data Storage has a detailed description of each storage in our clusters.
Snapshots of VSC_SCRATCH
and VSC_SCRATCH_VO
are limited to the last 7
days, while VSC_HOME
, VSC_DATA
and VSC_DATA_VO
have snapshots going
back to several months. In both cases, the snapshots of the last 7 days are
kept daily. This means that any file lost in the past week, can be recovered
with at much 24 hours of missing changes.
Locating the available snapshots
Snapshots are stored in the parent folder of your storage partition. This folders are hidden and they have different names depending on the storage system (replace YYYY, MM, DD with year, month, day, respectively):
VSC_HOME:
$HOME/../.snapshot
VSC_DATA:
$VSC_DATA/../.snapshot
VSC_DATA_VO:
$VSC_DATA_VO/../.snapshot
VSC_SCRATCH:
/rhea/scratch/.snapshots/backup_snap_YYYYMMDD/brussel/${USER:3:3}/$USER
VSC_SCRATCH_VO:
/rhea/scratch/.snapshots/backup_snap_YYYYMMDD/brussel/vo/000/$VSC_VO
Checking the available snapshots
To show the available snapshots, just list the contents of the corresponding snapshot folder:
ls $HOME/../.snapshot
You will see a list of directories with names starting with
SNAP_
, followed by the date and time when the snapshot was taken. For example,SNAP_2022_05_11_111455
was taken on May 11th 2022 at 11:14:55.Restoring your files
Once you find the lost file or folder inside one of the available snapshots, you can restore it by copying it to its original location:
cp -a $HOME/../.snapshot/SNAP_2022_05_11_111455/$USER/myfile $HOME/myfile.recovered
If you need help restoring your files, please contact VUB-HPC Support
2.4. Why is my job not starting?#
It is expected that jobs will stay some time in queue, the HPC cluster is a shared system and it is not trivial to efficiently distribute its resources. If your job has been in queue for less than 24h, it is probably normal. If your job has been in queue for more than 48h, then it is very rare. Factors that can explain longer wait times are:
load of the cluster: e.g. weekends and holidays see less load
the number of jobs that you have submitted recently: we have fair share policies that will reduce the priority of users with many jobs in the queue
the requested resources of your job script: requesting many cores (>40) or GPUs (less available) can take longer
The command mysinfo
shows an overview in real time of the available
hardware resources for each partition in the cluster, including cores, memory
and GPUs, as well as their current load and running state.
CLUSTER: hydra
PARTITION STATE [NODES x CPUS] CPUS(A/I/O/T) CPU_LOAD MEMORY MB GRES GRES_USED
ampere_gpu resv [ 2 x 32 ] 0/64/0/64 0.01-0.03 246989 MB gpu:a100:2(S:1) gpu:a100:0(IDX:N/A)
ampere_gpu mix [ 3 x 32 ] 66/30/0/96 13.92-19.47 257567 MB gpu:a100:2(S:0-1) gpu:a100:2(IDX:0-1)
ampere_gpu alloc [ 3 x 32 ] 96/0/0/96 3.27-32.00 257567 MB gpu:a100:2(S:0-1) gpu:a100:2(IDX:0-1)
broadwell_himem idle [ 1 x 40 ] 0/40/0/40 0.07 1492173 MB (null) gpu:0
[...]
zen4 mix [ 13 x 64 ] 346/486/0/832 0.02-50.06 386510 MB (null) gpu:0
zen4 alloc [ 7 x 64 ] 448/0/0/448 2.03-74.64 386510 MB (null) gpu:0
Tip
The command mysinfo -N
shows a detailed overview per node.
Helpdesk If it is not clear why your job is waiting in queue, do not cancel it. Contact us instead and we will analyse the situation of your job.
2.5. How can I monitor the status of my jobs?#
2.5.1. mysqueue#
The command mysqueue
shows a detailed overview of your jobs currently in
the queue, either PENDING to start or already RUNNING.
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES CPUS MIN_MEMORY NODELIST(REASON)
1125244 ampere_gpu gpu_job01 vsc10000 RUNNING 3-01:55:38 5-00:00:00 1 16 7810M node404
1125245 ampere_gpu gpu_job02 vsc10000 PENDING 0:00 5-00:00:00 1 16 10300M (Priority)
1125246 skylake my_job01 vsc10000 RUNNING 2-19:58:16 4-23:59:00 2 40 8G node[310,320]
1125247 pascal_gpu gpu_job03 vsc10000 PENDING 0:00 3-00:00:00 1 12 230G (Resources)
Each row in the table corresponds to one of your running or pending jobs or any individual running job in your Job arrays. You can check the PARTITION where each job is running or trying to start and the resources (TIME, NODES, CPUS, MIN_MEMORY) that are/will be allocated to it.
Note
The command mysqueue -t all
will show all your jobs in the last 24 hours.
The column NODELIST(REASON) will either show the list of nodes used by a running job or the reason behind the pending state of a job. The most common reason codes are the following:
- Priority
Job is waiting for other pending jobs in front to be processed.
- Resources
Job is in front of the queue but there are no available nodes with the requested resources.
- ReqNodeNotAvail
The requested partition/nodes are not available. This usually happens on a scheduled maintenance.
See also
Full list of reason tags for pending jobs.
2.5.2. Attaching an interactive shell to a job#
Users can also inspect running jobs by attaching an interactive shell to them.
This interactive shell runs in the same compute node and environment of the
target job, allowing to monitor what is actually happening inside the job. For
instance, you can check the running processes and memory consumption in real
time with the command ps aux
.
srun --jobid=<SLURM_JOBID> --pty bash
For MPI jobs and other tasks launched with srun
, add option --overlap
:
srun --jobid=<SLURM_JOBID> --overlap --pty bash
For multi-node jobs, to get into a specific node, add option -w <node-name>
.
The list of nodes corresponding to a job can be obtained with mysqueue or
slurm_jobinfo. For example, to launch an interactive shell on
node345
, do:
srun --jobid=<SLURM_JOBID> --overlap -w node345 --pty bash
2.6. Why has my job crashed?#
There are many possible causes for a job crash. Here are just a few of the more common ones:
- Requested resources (memory or time limit) exceeded
The command
mysacct
shows an overview of the resources used by your recent jobs. You can check if the job FAILED or COMPLETED in the State column, if the Elapsed time reached the Timelimit or how much much memory did it use (MaxRSS). See also How can I check my resource usage?.- Wrong module(s) loaded
Check the syntax (case sensitive) and version of the modules. If you are using more than one module, check that they use a compatible toolchain. See The toolchain of software packages.
- Wrong input filetype / input read errors
Convert from DOS to UNIX text format with the command:
dos2unix <input_file>
2.7. How can I run a job longer than the maximum allowed wall time?#
The maximum wall time for calculations on Hydra is 5 days. If your job requires a longer wall time, there are a few options to consider.
If the software supports multiprocessing and/or multithreading, you can request more CPU cores for your job. Consult the documentation of the software for parallel execution instructions. However, to prevent wasting computational resources, you have to make sure that adding more cores gives you the expected speedup. It is very important that you perform a parallel efficiency test. For more info on parallel efficiency, see the HPC training slides.
Note
It is recommended to increase the requested RAM memory proportionally to the number of processes using
--mem-per-cpu=X
. If you don’t specify the requested memory, this is done automatically for you.It is common that scientific software provides methods to stop and restart simulations in a controlled manner. In such case, long calculations can be split into restartable shorter chunks, which can be submitted one after the other. Check the documentation of your software for restarting options.
If the software does not support any restart method, you can use external checkpointing. See our section on Job Checkpoints and Restarts for more details.
2.8. My jobs run slower than expected, what can I do?#
The most common scenario for low performing jobs is a mismanagement of computational resources. Some examples are:
- Generating a lot more threads/processes than available cores
Software than can parallelize over multiple cores runs optimally if the total number of active threads/processes is in line with the number of cores allocated to your job. As rule of thumb, 1 process per core. However, by default, software might use the total number of cores in the node instead of the cores available to your job to determine how many threads/processes to generate. On other instances, executing parallel software from within scripts that are already parallelized can also lead to too many threads/processes. In both cases performance will degrade.
- Jobs with barely the minimum memory to work
In situations of limited memory, but large enough to guarantee proper execution, applications might need to start swapping memory to disk to not reach an out-of-memory error. Accessing the disk is slow and should be avoided as much as possible. In this cases, increasing the requested memory to have a generous margin (typically ~20%) will allow the application to load more data in it, keeping the same efficiency and increasing performance.
Please check the list of instructions for specific software and the recommendations to lower the running time of your jobs. Those can usually be applied to jobs with degraded performance.
In very rare cases, your jobs might slow down due to jobs of other users misusing the resources of the cluster. If you suspect that this is the case, please contact VUB-HPC Support.