Frequently asked questions (FAQ)

Hydra

1. How can I connect to Hydra?

First check the page Creating an account, to make sure that you have an active VSC account.

Next, follow the instructions in the VSC docs on Logging in to a cluster.

If you cannot connect after following the instructions, here are a few things you can check:

  • Is your passphrase correct?

    Upon login you will be asked to re-enter the passphrase that you entered during generation of your SSH key pair.

    To reset your passphrase, you have to generate a new SSH key pair and upload the new public key to your accountpage at https://account.vscentrum.be. Note that it can take up to 1 hour before you will be able to use your new SSH key.

  • Are you using an SSH key?

    Make sure that your private key file is in the right place and has the right permissions. It needs to have read permissions for the owner of the file only.

If you still cannot connect, please contact us at hpc@vub.ac.be.

2. What can I do in the login node?

The login node is your main interface with the compute nodes. This is where you can access your data, write the scripts and input files for your calculations and submit them to the job scheduler of Hydra. It is also possible to run small scripts in the login nodes, for instance to process the resulting data from your calculations or to test your scripts before submission to the scheduler. However, the login node is not the place to run your calculations and hence, the following restrictions apply

  • Any single user can use a maximum of 12GB of memory in the login node.

  • The amount of CPU time that can be used is always fairly divided over all users. A single user cannot occupy all CPU cores.

  • The allowed network connections to the outside are limited to SSH, FTP, HTTP and HTTPS.

A graphical desktop environment is available through a VNC, allowing the use of complex graphical programs and visualization tools. More information in the Software section: 4. How can I run graphical applications?

Jobs submitted to the scheduler are preprocessed before placement in the queues to ensure that their requirements of resources are correct. For instance, the system automatically assigns memory limits to your job if you didn’t specify it manually. Detailed information can be found in the section Job Submission: Hydra.

Users compiling their own software should check 3. How can I build/install additional software/packages?.

4. How can I check my disk quota?

To prevent your account from becoming unusable, you should regularly check your disk quota and cleanup your $HOME ($VSC_HOME), $VSC_DATA (VSC accounts only) and $VSC_SCRATCH.

You can check your disk quota with the following command:

myquota

Users with a VSC account will regularly receive warning emails when they have reached 90% of their quota.

For more info, please consult the VSC docs on disk space usage.

5. I have exceeded my $HOME disk quota, what now?

When your $HOME ($VSC_HOME) is full your account will become unusable.

In that case you have to delete some files from your $HOME, such as temporary files, checkpoint files, until there is enough free space.

If you need more backup storage, users with a VSC account can use $VSC_DATA.

If you need large temporary storage for your running jobs, you can use $VSC_SCRATCH. Files will not be deleted there, but there is also no backup of your data on $VSC_SCRATCH.

Remember that the HPC team is not responsible for data loss (although we will do everything we can to avoid it). You should regularly backup important data and clean-up your $HOME, $VSC_DATA and $VSC_SCRATCH.

If you need more storage than is available by default, please contact us at hpc@vub.ac.be.

6. How can I check my resource usage?

Making efficient use of the HPC cluster is important for you and your colleagues:

  • Your jobs will start faster.

  • Better usage means the HPC team can buy more/faster hardware.

The 3 main resources to keep an eye on are:

  • memory usage

  • wall time usage

  • cores usage (CPU time): how many cores are doing actual work?

You can check resource usage of running and recently finished jobs with the command:

myresources

Remarks:

  • Cores usage is only reported for jobs that are running longer than 5 minutes.

  • If your requested memory is 1GiB or less, this is always considered good (you get 2GiB ‘for free’).

You can also check resource usage of finished jobs at the end of your job output file. The last few lines of this file show the requested and used resources, for example:

Resources Requested: walltime=00:05:00,nodes=1:ppn=1,mem=1gb,neednodes=1:ppn=1
Resources Used: cput=00:01:19,vmem=105272kb,walltime=00:01:22,mem=17988kb,energy_used=0

7. I have accidentally deleted some data, how can I restore it?

First of all, remember that you are responsible for your own backups: the HPC team does not guarantee the safety of your data on Hydra. Nevertheless we will do everything we can to avoid any data-loss.

If your deleted data was on your $VSC_SCRATCH, the data is permanently lost, as we do not make any backups there.

We do make backups of $HOME and $VSC_DATA, and we keep them for 1 month.

If you need a file or directory that you accidentally deleted or modified, please contact us at hpc@vub.ac.be.

8. How can I share files with other users?

On Hydra, you can grant a group or an individual user access to some of your files or directories with an access control list (ACL).

Example 1: to give group ‘flash’ access to ‘my_dir’:

setfacl -m g:flash:rx my_dir

Example 2: to give user ‘andrius’ read and write permissions on ‘my_file’ (make sure the user has also access to the parent directory):

setfacl -m u:andrius:rw my_file

Example 3: to remove write permissions from user ‘andrius’ on ‘my_file’:

setfacl -x u:andrius:w my_file

Example 4: to remove all ACLs from ‘my_file’:

setfacl -b my_file

For more information on ACLs, see: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/storage_administration_guide/acls-setting.

9. Why is my job not starting?

First of all, do not panic if your job does not start immediately. Depending on Hydra load, the number of jobs that you have submitted recently, and the requested resources in your job script, waiting times can take hours to days. Usually, the load on Hydra is higher on weekdays, and lower in the weekends and during holidays. Remember that the more resources you request, the longer you may have to wait in the queue.

To get an overview of the available hardware and their current load, you can issue the command:

nodestat

10. Why has my job crashed?

There are many possible causes for a job crash. Here are just a few of the more common ones:

  • Requested resources (memory or walltime) exceeded

    Check the last few lines of your job output file. If the used and requested memory or wall time are very similar, you have probably exceeded your requested resource. Increase the resource in your job script and resubmit.

  • Wrong module(s) loaded

    Check the syntax (case sensitive) and version of the modules. If you are using more than one module, check that they are compiled with the same toolchain.

  • Wrong input filetype

    Convert from dos to unix with the command:

    dos2unix <input_file>
    

11. How can I use the GPUs to run my jobs?

The available GPUs are listed in the VSC docs on Hydra hardware.

If you want to run a GPU job, submit with the following PBS directives:

#PBS -l nodes=1:ppn=1:gpus=1

If you want to run your job on a specific GPU type, add one of the following features:

#PBS -l feature=kepler  (for the K20Xm)
#PBS -l feature=pascal  (for the P100)
#PBS -l feature=geforce (for the 1080Ti)

Note however that if you make more specific job requests you may have to wait longer in the queue.

Important: before you submit a GPU job, make sure that you use software that is optimized for running on GPUs with support for CUDA. You can check this by looking at the name of the module. If the module name contains one of the words CUDA, fosscuda, goolfc or gompic, the software is built with CUDA support.

12. Jobs on the GPU nodes fail with “all CUDA-capable devices are busy or unavailable”

By default the GPU cards operate in process exclusive mode, meaning that only one process at a time can use the GPU. Hence, the GPU will appear as busy or unavailable to any process trying to use it after the first one. Calculation with multiple processes running in several cores can share a single GPU using one of the following methods

  1. Put the GPU in shared mode. An unlimited number of processes can then use the GPU:

    #!/bin/bash
    #PBS -l nodes=1:ppn=4:gpus=1:shared
    
    your_script_instructions
    
  2. (Recommended) Launch the CUDA MPS daemon at beginning of your script and then just continue with your normal job script. It should all happen automatically. This program coordinates the access to the GPUs:

    #!/bin/bash
    #PBS -l nodes=1:ppn=2:gpus=1
    
    export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps
    nvidia-cuda-mps-control -d
    
    your_script_instructions
    

13. How can I run a job longer than the maximum allowed wall time?

The maximum wall time for calculations on Hydra is 5 days. If your job requires a longer wall time, there are a few options to consider.

  1. If the software supports multiprocessing and/or multithreading, you can request more CPU cores for your job. Consult the documentation of the software for parallel execution instructions. However, to prevent wasting computational resources, you have to make sure that adding more cores gives you the expected speedup. It is very important that you perform a parallel efficiency test. For more info on parallel efficiency, see the HPC training slides. Note also that you may have to increase the requested RAM memory proportional to the number of processes.

  2. Long running calculations can often be split into restartable chunks that can be submitted one after the other. First consult the documentation of the software itself for restarting options.

  3. If the software does not support restarting, you can use checkpointing. Checkpointing means making a snapshot of the current state of your calculation and saving it to disk. The snapshot can then be used to restore the state and continue your calculation in a new job. Checkpointing on Hydra can be done conveniently with csub, a tool that automates the process of:

  • halting the job

  • checkpointing the job

  • killing the job

  • re-submitting the job to the queue

  • reloading the checkpointed state into memory

For example, to submit the job script ‘myjob.pbs’ with checkpointing and re-submitting every 24 hours, type:

csub -s myjob.pbs --shared --job_time=24:00:00

This checkpointing and re-submitting cycle will be repeated until your calculation has completed.

Notes:

  • Checkpointing and reloading is done as part of the job, and typically takes up to 15 minutes depending on the amount of RAM memory being used. Thus, in the example above you should specify the following PBS directive in your job script:

    #PBS -l walltime=24:15:00
    
  • Job output and error files are written in your directory $VSC_SCRATCH/chkpt (along with checkpoint files and csub log files). Other output files created by your job may also be written there.

  • Internally, csub uses DMTCP (Distributed MultiThreaded CheckPointing). Users who want full control can also use DMTCP directly. Example launch/restart job scripts can be downloaded here:

  • csub/DMTCP is not yet tested with all installed software on Hydra. It has been successfully used with software written in Python, R, and with Gaussian. For more info on DMTCP-supported applications, see http://dmtcp.sourceforge.net/supportedApps.html

  • If you are running a Gaussian 16 job with csub, a few extra lines must be added to your job script:

    module load Gaussian/G16.A.03-intel-2017b
    unset LD_PRELOAD
    module unload jemalloc/4.5.0-intel-2017b
    export G09NOTUNING=1
    export GAUSS_SCRDIR=$VSC_SCRATCH/<my_unique_gauss_scrdir>  # make sure this directory is present
    

If you run into issues with checkpointing/restarting, please contact us at hpc@vub.ac.be.

14. Where can I find public datasets and databases?

To avoid that each user has to download their own private copy of publicly accessible data, a shared directory /databases is provided.

Users who need public data to run their calculations should always check first if it is already available in /databases. If the data is not available, the HPC team can add it upon request at hpc@vub.ac.be.

15. Can I run containers on Hydra?

Singularity containers are supported on Hydra for quickly testing software before creating an optimized build, and for some very complex installations.

Singularity allows running a container without root privileges. Users with root access on a Linux machine can create their own Singularity image, either manually or by importing an existing Docker image. The HPC team can also create images upon request at hpc@vub.ac.be.

Note that our module system is still the preferred method of running software, because Singularity containers are usually not optimized for the different CPU architectures and put more network pressure on the cluster.

For more information on Singularity containers, see: https://www.sylabs.io/docs

16. Can I copy files between Hydra and ULB/VUB’s ownCloud directly?

You can indeed copy files between Hydra and the ownCloud of ULB/VUB directly. This avoids copying the files to/from your local computer as an intermediate step. The davix tool communicates with the ownCloud server via the WebDAV protocol.

For security reasons, you should never use your netID password. Instead, generate a dedicated App password for davix in the ownCloud web interface:

  1. Go to your personal Settings page.

  2. In the sidebar, click on Security.

  3. Under the heading App passwords / tokens, type davix in the App name field.

  4. Click on Create new app passcode.

  5. Copy the generated password and save it in a secure place. This password should always be used with davix.

Example usage:

  • Copy file from_oc.txt from your ownCloud home directory:

    davix-get https://owncloud.vub.ac.be/remote.php/webdav/from_oc.txt from_oc.txt --userlogin <netID> --userpass <passwd>
    
  • Copy file to_oc.txt to your ownCloud home directory:

    davix-put to_oc.txt https://owncloud.vub.ac.be/remote.php/webdav/to_oc.txt --userlogin <netID> --userpass <passwd>
    
  • Copy directory mydir recursively to your ownCloud home directory (using 4 concurrent threads for increased speed):

    davix-put mydir https://owncloud.vub.ac.be/remote.php/webdav/mydir --userlogin <netID> --userpass <passwd> -r 4
    

For more information on davix command line tools, see: https://davix.web.cern.ch/davix/docs/master/cli-examples.html

17. How can I transfer data to/from Hydra with Globus?

Hydra is already available in Globus with its own collection. The name of Hydra’s collection is VSC VUB Tier2 (or ULB Hydra for NetID accounts). Please follow the steps below to add Hydra in your Globus account

  1. Install and configure Globus Personal Connect in your local computer following the VSC guide on Globus

  2. Open Globus and select the File Manager in the left panel

  3. Write VSC VUB Tier2 or ULB Hydra in the Collections field and select it

  4. At this point, the storage of Hydra will open and you can navigate it within Globus. Only data in your $VSC_DATA and $VSC_SCRATCH will be accessible

    • Path to your VSC_SCRATCH: /~/scratch/brussel/<vsc_first_3_digits>/<vsc_username>/

    • Path to your VSC_DATA: /~/data/brussel/<vsc_first_3_digits>/<vsc_username>/

    For NetID accounts, both your home and workdir are accessible. By default you will go to your home directory. For the work directory you need to change the path to /work/<NetID username>

Note

We recommend creating bookmarks to easily access data in your account

18. How can I automatize the transfer of data to/from Hydra?

Automatic (scripted) data transfer between Hydra and external SSH servers can be safely done using rsync in Hydra with a secure SSH connection without password. The authentication with the external server is done with a specific pair of keys not requiring any additional password or passphrase from the user. Once the passwordless SSH connection between Hydra and the external server is configured, rsync can use it to transfer data between them.

The only caveat of this method is that anybody gaining access to your Hydra account will automatically gain access to your account in the external server as well. Therefore, it is very important that you use a user account in the external server that is exclusively used for sending/receiving files to/from Hydra and that has limited user rights.

The following steps show the easiest way to setup a secure connection without password to an external server:

  1. Check the connection to the external server from Hydra: Login to Hydra and try to connect to the external server with a regular SSH connection using a password. If this step does not work, your server may not be reachable from Hydra and you should contact the administrators of the external server to make it accessible.:

    $ ssh <username>@<hostname.external.server>
    
  2. Create a SSH key pair without passphrase: Login to Hydra and create a new pair of SSH keys that will be exclusively used for data transfers with external servers. The new keys have to be stored inside the .ssh folder in your home directory. In the example below, the new key is called id_filetransfer. Leave the passphrase field empty to avoid any password prompt on authentication.:

    $ ssh-keygen
    Generating public/private rsa key pair.
    Enter file in which to save the key (/your/home/.ssh/id_rsa): </your/home/.ssh/id_filetransfer>
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in id_filetransfer.
    Your public key has been saved in id_filetransfer.pub.
    [...]
    
  3. Transfer the keys to the external server: The new key created in Hydra without a passphrase has to be installed in the external server as well. In this step you will have to provide your password to connect to the external server.:

    $ ssh-copy-id -i ~/.ssh/id_filetransfer <username>@<hostname.external.server>
    
  4. Configure the connection to the external server: The specific keys used in the connection with the external server can be defined in the file ~/.ssh/config. This avoids having to explicitly set the option -i ~/.ssh/id_filetransfer on every SSH connection. Add the following lines at the bottom of your ~/.ssh/config file in Hydra (create the file if it does not exist):

    Host <hostname.external.server>
        User <username>
        IdentityFile ~/.ssh/id_filetransfer
    
  5. Check the passwordless connection: At this point it should be possible to connect from Hydra to the external server with the new keys and without any prompt for a password.:

    $ ssh <username>@<hostname.external.server>
    
  6. Automatic copy of files: Once the passwordless SSH connection is properly configured, rsync will automatically use it. You can execute the following commands in Hydra to either transfer data to the external server or from the external server.:

    # Transfer from  Hydra to external server
    $ rsync -av /path/to/source <username>@<hostname.external.server>:/path/to/destination
    # Transfer from external server to Hydra
    $ rsync -av <username>@<hostname.external.server>:/path/to/source /path/to/destination
    

Vega

(TODO)

VSC Tier-1

(TODO)

CECI Tier-1

see the CECI website: http://www.ceci-hpc.be/faq.html