HPC Data Management#

Handling and organizing your data in the HPC is an important part of your research projects. The following sections cover different methods to move data in and out of the HPC and between the different storage systems in the cluster.

Data transfer#

We provide multiple tools to transfer data in and out of our HPC clusters. Each tool has its strengths and weaknesses, so it’s important to pick the one that better fits your needs. Usually the main factor differentiating these tools is the type of system holding the data outside the HPC cluster:

  • Data transfer from/to public and private servers

    • wget is the most simple tool to download files from links (URLs) on the Internet on any Linux system. It is available on all nodes in the HPC.

    • aria2 is more powerful than wget. It can download files using additional protocols (HTTP/HTTPS, FTP, SFTP, BitTorrent, Metalink) as well as downloading files in parallel.

    • git is common tool to download source code from public and private repositories.

    • Globus is a powerful platform connecting research centres and personal computers around the globe that allows to transfer data securely and fast between them.

    • davix can be used to transfer data from WebDAV servers such as Nextcloud or ownCloud

  • Data transfer from/to personal computers

    • rsync is an efficient and secure tool to copy folders and files over the Internet and to easily keep them synchronised.

    • OneDrive of VUB is connected to our Tier-2 HPC (Hydra). You can easily sync files from your computer in OneDrive and also keep those same files synced in Hydra.

    • Globus is a powerful platform connecting research centres and personal computers around the globe that allows to transfer data securely and fast between them.

Your OneDrive in VUB#

You can directly copy files between Hydra and your OneDrive in VUB using the third-party sync app OneDrive Client for Linux. This method avoids any additional step to copy the files to/from OneDrive to/your your local computer before transferring them to the HPC.

Warning

Several restrictions and limitations apply to OneDrive:

  • OneDrive does not discriminate capitalization in file names. Avoid having two files in the same folder that only differ in the capitalization.

  • OneDrive does not allow filenames that contain any of the characters \/:*?""<>|. Files that contain any of these characters will not be synced.

  • The following names aren’t allowed for files or folders: .lock, CON, PRN, AUX, NUL, COM0 - COM9, LPT0 - LPT9, _vti_, desktop.ini, any filename starting with ~$.

  • “_vti_” cannot appear anywhere in a file name.

  • “forms” isn’t supported when the folder is at the root level for a library.

  • You can’t create a folder name in SharePoint that begins with a tilde (~).

Client authorization#

  1. Start by executing the command onedrive in a terminal shell on Hydra. The first time it will print the following information on screen:

    onedrive
    

    Upon execution, a URL starting with https://login.microsoftonline.com is shown to authorize the client to access your VUB Office 365 account. The URL contains the client_id of the sync app, which should be exactly d50ca740-c83f-4d1b-b616-12c519384f0c:

    Output:#
    $ onedrive
    Configuring Global Azure AD Endpoints
    Authorize this app visiting:
    
    https://login.microsoftonline.com/[...]
    
    Enter the response uri:
    
  2. Copy/paste the full URL in your browser

  3. Log in with your credentials if necessary. You should be redirected to a blank page in your browser

  4. Copy/paste the URL of the blank page into the prompt of onedrive in Hydra

At this point, if there is no error, your client should have access to your account. By default, the access token to Office 365 is stored in the file ~/.config/onedrive/refresh_token

Synchronize with personal OneDrive#

  1. Create a directory that will be synced with your OneDrive.

    The following command creates the sync directory hydra-sync inside $VSC_DATA/onedrive (avoid using $HOME as it is small).

    mkdir -p $VSC_DATA/onedrive/hydra-sync
    
  2. Create the configuration file ~/.config/onedrive/config.

    The following commands generate the config file. The entry sync_dir is mandatory and points to the parent directory of the sync directory. Also, we recommend to skip syncing symlinks and dotfiles (files that start with a dot) by default to avoid unexpected data transfers unless you know that you need those.

    config=~/.config/onedrive/config
    echo sync_dir = \"$VSC_DATA/onedrive\" > $config
    echo 'skip_symlinks = "true"' >> $config
    echo 'skip_dotfiles = "true"' >> $config
    
  3. Create the sync_list file ~/.config/onedrive/sync_list.

    The following command adds the sync directory hydra-sync to the sync_list file. This ensures that only data inside the sync directory is synced.

    echo hydra-sync > ~/.config/onedrive/sync_list
    
  4. Check if the OneDrive client has been configured correctly.

    onedrive --resync --synchronize --verbose --dry-run
    
  5. If the dry-run succeeded, re-run the above command but remove the --dry-run option to do the real sync.

    onedrive --resync --synchronize --verbose
    

    If the sync is successful, the sync directory (here: hydra-sync) should show up under My files in your VUB OneDrive.

  6. For subsequent synchronizations, remove also the --resync option to avoid any further full synchronization. A resync is only needed after modifying the configuration or sync_list file.

    onedrive --synchronize --verbose
    

Globus#

The Globus platform allows to efficiently move data between the different VSC clusters (and other hosts that support Globus) and to/from your local computer. Globus supports data transfer via a web interface, a CLI interface, and a Python SDK. Globus can also be used for VSCdocsharing data with other researchers, both VSC users and external colleagues.

Globus web interface#

Hydra is already available in Globus with its own collection. The name of Hydra’s collection is VSC VUB Hydra. Please follow the steps below to add Hydra to your Globus account:

  1. Log in to the Globus web app with your account from any institute in the VSC network as described in VSCdocGlobus Access

  2. Open the File Manager on the left panel

  3. Write VSC VUB Hydra in the Collections field and select it

  4. At this point, when you’re using the VSC VUB Hydra collection for the first time, you will be requested to give further authentication/consent to your account

    1. Click on Continue

    2. Select the first identity from the list. For VUB users, the first identity in the list should be a link of the form username@vub.be

    3. Click Allow

    Note

    Linking an identity from another institute is not necessary to access the collection in Hydra. We allow access from all VSC institutes.

  5. The storage of Hydra will open and you can navigate through it within Globus. By default you will see the contents of your home directory, but you can also navigate to your other partitions in Hydra.

    Copy the path printed by the following echo commands in your shell in Hydra, paste it in the file manager in Globus and press enter to jump to the corresponding storage partition:

    • VSC_DATA: echo $VSC_DATA

      Example for vsc10001#
      /data/brussel/100/vsc10001/
      
    • VSC_SCRATCH: echo $VSC_SCRATCH

      Example for vsc10001#
      /scratch/brussel/100/vsc10001/
      
    • VSC_DATA_VO_USER: echo $VSC_DATA_VO_USER

      Example for vsc10001#
      /data/brussel/vo/000/bvo00005/vsc10001/
      
    • VSC_SCRATCH_VO_USER: echo $VSC_SCRATCH_VO_USER

      Example for vsc10001#
      /scratch/brussel/vo/000/bvo00005/vsc10001/
      

    Tip

    Create bookmarks in Globus to easily access your data in Hydra

Data transfer to/from your local computer#

Install and configure Globus Personal Connect in your local computer following the Globus documentation:

Globus CLI and Python SDK#

To automate your data transfers, in addition to the web interface, you can also use either the VSCdocGlobus CLI or the VSCdocPython SDK, both of which are available in Hydra upon loading the following module:

module load Globus-CLI

Connect VUB’s OneDrive with Globus#

VUB OneDrive is available in Globus with its own collection. The name of the collection is VUB OneDrive. Please follow the steps below to add VUB OneDrive to your Globus account:

  1. Log in to the Globus web app with your account from any institute in the VSC network as described in VSCdocGlobus Access

  2. Open the File Manager on the left panel

  3. Write VUB OneDrive in the Collections field and select it

  4. At this point, when you’re using the VSC VUB Hydra collection for the first time, you will be requested to give further authentication/consent to your account

    1. Click Continue

    2. Select the first identity from the list. For VUB users, the first identity in the list should be a link of the form username@vub.be

    3. Click Allow

  5. You are now redirected to the File Manager page, requesting authentication for a second time

    1. Click Continue

    2. In the Credentials tab, click Add credential

    3. Click Continue

  6. The storage of VUB OneDrive will open and you can navigate through it within Globus. You can now transfer data between VUB OneDrive and the VSC VUB Hydra collection or your local computer.

Tip

You can also access data in SharePoint through the VUB OneDrive collection in Globus. Navigate to the parent folder of your OneDrive collection and you will find your SharePoint sites as well.

WebDAV services as Nextcloud/ownCloud#

You can indeed copy files directly between Hydra and cloud services that support the WebDAV protocol, such as Nextcloud/ownCloud. This avoids copying the files to/from your local computer as an intermediate step. The davix tools are installed by default in the login nodes.

Example Using davix to copy files between Hydra and your cloud service:

Copy the file myfile.txt from your cloud home directory to Hydra (replace <cloud_url> with the webdav url of the cloud service)#
davix-get <cloud_url>/myfile.txt myfile.txt --userlogin <login> --userpass <passwd>
Copy the file myfile.txt from Hydra to your cloud home directory#
davix-put myfile.txt <cloud_url>/myfile.txt --userlogin <login> --userpass <passwd>
Copy directory mydir recursively from Hydra to your cloud home directory (using 4 concurrent threads with -r for increased speed)#
davix-put mydir <cloud_url>/mydir --userlogin <login> --userpass <passwd> -r 4

See also

The davix documentation.

Data in jobs#

Data management in jobs can be cumbersome to handle for those users running many jobs regularly. The solution is to automatize as much as possible in your job scripts. The shell scripting language used in job scripts is quite powerful and can be used to automatically create directories, copy data and change files before executing the code running your simulations.

Automatic working directory in scratch#

As described in HPC Data Storage, all job scripts must read/write data from/to scratch to run with optimal performance. It can be either the personal VSC_SCRATCH or the scratch of your VO VSC_SCRATCH_VO. However, important datasets and results should not be left in the scratch and instead be transferred to the more reliable storage of VSC_DATA or VSC_DATA_VO, which has backups.

The job script below is a simple example to create a transient working directory on-the-fly in VSC_SCRATCH that will be used to read/write all data during execution. The job will perform the following steps:

  1. Create a new unique working directory in VSC_SCRATCH

  2. Copy all input files from a folder in VSC_DATA into the working directory

  3. Execute the simulation using the files from the working directory

  4. Copy the output file to the same directory from where the job was launched

  5. Delete the working directory

Job script to automatically manage data between VSC_DATA and VSC_SCRATCH#
 1#!/bin/bash
 2
 3#SBATCH --job-name="your-job"
 4#SBATCH --output="%x-%j.out"
 5
 6module load AwesomeSoftware/1.1.1
 7
 8# Input data directory in VSC_DATA
 9DATADIR="${VSC_DATA}/my-project/dataset01"
10
11# Working directory for this job in VSC_SCRATCH
12WORKDIR="${VSC_SCRATCH:-/tmp}/${SLURM_JOB_NAME:-$USER}.${SLURM_JOB_ID:-0}"
13
14# Populate working directory
15echo "== Populating new working directory: $WORKDIR"
16mkdir -p "$WORKDIR"
17rsync -av "$DATADIR/" "$WORKDIR/"
18cd "$WORKDIR"
19
20# Start simulation
21# note: adapt to your case, input/output files might be handled differently
22<your-command> data.inp > results.out
23
24# Save output and clean the scratch
25# (these steps are optional, you can also perform these manually once the job finishes)
26cp -a results.out "$SLURM_SUBMIT_DIR/"
27rm -r "$WORKDIR"

Automatic data transfer in jobs#

Automatic (scripted) data transfer between Hydra and external SSH servers can be safely done using rsync in Hydra with a secure SSH connection without password. The authentication with the external server is done with a specific pair of keys not requiring any additional password or passphrase from the user. Once the passwordless SSH connection between Hydra and the external server is configured, rsync can use it to transfer data between them.

Important

This method enables automatic login to external servers from Hydra. This means that anybody gaining access to your Hydra account (which should never happen) will automatically gain access to your account in that external server as well. Therefore, it is very important that you use a user account in the external server that is exclusively used for sending/receiving files to/from Hydra and that has limited user rights.

The following steps show the easiest way to setup a secure connection without password to an external server:

  1. Check the connection to the external server from Hydra: Login to Hydra and try to connect to the external server with a regular SSH connection using a password. If this step does not work, your server may not be reachable from Hydra and you should contact the administrators of the external server to make it accessible.

    $ ssh <username>@<hostname.external.server>
    
  2. Create a SSH key pair without passphrase: Login to Hydra and create a new pair of SSH keys to be exclusively used for data transfers with your external server. The new keys have to be stored inside the .ssh folder in your home directory. In the example below, the new key is called id_filetransfer. Leave the passphrase field empty to avoid any password prompt on authentication.

    $ ssh-keygen
    Generating public/private rsa key pair.
    Enter file in which to save the key (/your/home/.ssh/id_rsa): </your/home/.ssh/id_filetransfer>
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in id_filetransfer.
    Your public key has been saved in id_filetransfer.pub.
    [...]
    
  3. Transfer the keys to the external server: The new key created in Hydra without a passphrase has to be installed in the external server as well. You can do that with the command ssh-copy-id, which will first ask for your password to connect to the external server and then copy your new passwordless key id_filetransfer into it.

    $ ssh-copy-id -i ~/.ssh/id_filetransfer <username>@<hostname.external.server>
    
  4. Configure the connection to the external server: The specific keys used in the connection with the external server can be defined in the file ~/.ssh/config. This avoids having to explicitly set the option -i ~/.ssh/id_filetransfer on every SSH connection. Add the following lines at the bottom of your ~/.ssh/config file in Hydra (or create the file if it does not exist):

    1Host <hostname.external.server>
    2    User <username>
    3    IdentityFile ~/.ssh/id_filetransfer
    

    Verify that the permissions of ~/.ssh/config are properly set after editing it, otherwise you might get permission errors in the next steps:

    $ chmod 600 ~/.ssh/config
    
  5. Check the passwordless connection: At this point it should be possible to connect from Hydra to the external server with the new keys and without any prompt for a password.

    $ ssh <username>@<hostname.external.server>
    
  6. Automatic copy of files: Once the passwordless SSH connection is properly configured, rsync will automatically use it. You can execute the following commands in Hydra to either transfer data to the external server or from the external server:

    Transfer from Hydra to external server#
    $ rsync -av /path/to/source <username>@<hostname.external.server>:/path/to/destination
    
    Transfer from external server to Hydra#
    $ rsync -av <username>@<hostname.external.server>:/path/to/source /path/to/destination