HPC Data Management#
Handling and organizing your data in the HPC is an important part of your research projects. The following sections cover different methods to move data in and out of the HPC and between the different storage systems in the cluster.
Data transfer#
We provide multiple tools to transfer data in and out of our HPC clusters. Each tool has its strengths and weaknesses, so it’s important to pick the one that better fits your needs. Usually the main factor differentiating these tools is the type of system holding the data outside the HPC cluster:
Data transfer from/to public and private servers
wget is the most simple tool to download files from links (URLs) on the Internet on any Linux system. It is available on all nodes in the HPC.
aria2 is more powerful than
wget
. It can download files using additional protocols (HTTP/HTTPS, FTP, SFTP, BitTorrent, Metalink) as well as downloading files in parallel.git is common tool to download source code from public and private repositories.
Globus is a powerful platform connecting research centres and personal computers around the globe that allows to transfer data securely and fast between them.
davix can be used to transfer data from WebDAV servers such as Nextcloud or ownCloud
Data transfer from/to personal computers
rsync is an efficient and secure tool to copy folders and files over the Internet and to easily keep them synchronised.
OneDrive of VUB is connected to our Tier-2 HPC (Hydra). You can easily sync files from your computer in OneDrive and also keep those same files synced in Hydra.
Globus is a powerful platform connecting research centres and personal computers around the globe that allows to transfer data securely and fast between them.
Your OneDrive in VUB#
You can directly copy files between Hydra and your OneDrive in VUB using the third-party sync app OneDrive Client for Linux. This method avoids any additional step to copy the files to/from OneDrive to/your your local computer before transferring them to the HPC.
Warning
Several restrictions and limitations apply to OneDrive:
OneDrive does not discriminate capitalization in file names. Avoid having two files in the same folder that only differ in the capitalization.
OneDrive does not allow filenames that contain any of the characters
\/:*?""<>|
. Files that contain any of these characters will not be synced.The following names aren’t allowed for files or folders: .lock, CON, PRN, AUX, NUL, COM0 - COM9, LPT0 - LPT9, _vti_, desktop.ini, any filename starting with ~$.
“_vti_” cannot appear anywhere in a file name.
“forms” isn’t supported when the folder is at the root level for a library.
You can’t create a folder name in SharePoint that begins with a tilde (~).
Synchronize with personal OneDrive#
Create a directory that will be synced with your OneDrive.
The following command creates the sync directory
hydra-sync
inside$VSC_DATA/onedrive
(avoid using$HOME
as it is small).mkdir -p $VSC_DATA/onedrive/hydra-sync
Create the configuration file
~/.config/onedrive/config
.The following commands generate the config file. The entry
sync_dir
is mandatory and points to the parent directory of the sync directory. Also, we recommend to skip syncing symlinks and dotfiles (files that start with a dot) by default to avoid unexpected data transfers unless you know that you need those.config=~/.config/onedrive/config echo sync_dir = \"$VSC_DATA/onedrive\" > $config echo 'skip_symlinks = "true"' >> $config echo 'skip_dotfiles = "true"' >> $config
Create the sync_list file
~/.config/onedrive/sync_list
.The following command adds the sync directory
hydra-sync
to the sync_list file. This ensures that only data inside the sync directory is synced.echo hydra-sync > ~/.config/onedrive/sync_list
Check if the OneDrive client has been configured correctly.
onedrive --resync --synchronize --verbose --dry-run
If the dry-run succeeded, re-run the above command but remove the
--dry-run
option to do the real sync.onedrive --resync --synchronize --verbose
If the sync is successful, the sync directory (here:
hydra-sync
) should show up underMy files
in your VUB OneDrive.For subsequent synchronizations, remove also the
--resync
option to avoid any further full synchronization. A resync is only needed after modifying the configuration orsync_list
file.onedrive --synchronize --verbose
See also
Globus#
The Globus platform allows to efficiently move data between the different VSC clusters (and other hosts that support Globus) and to/from your local computer. Globus supports data transfer via a web interface, a CLI interface, and a Python SDK. Globus can also be used for VSCdocsharing data with other researchers, both VSC users and external colleagues.
Globus web interface#
Hydra is already available in Globus with its own collection. The name of Hydra’s collection is VSC VUB Hydra. Please follow the steps below to add Hydra to your Globus account:
Log in to the Globus web app with your account from any institute in the VSC network as described in VSCdocGlobus Access
Open the File Manager on the left panel
Write VSC VUB Hydra in the Collections field and select it
At this point, when you’re using the VSC VUB Hydra collection for the first time, you will be requested to give further authentication/consent to your account
Click on Continue
Select the first identity from the list. For VUB users, the first identity in the list should be a link of the form username@vub.be
Click Allow
Note
Linking an identity from another institute is not necessary to access the collection in Hydra. We allow access from all VSC institutes.
The storage of Hydra will open and you can navigate through it within Globus. By default you will see the contents of your home directory, but you can also navigate to your other partitions in Hydra.
Copy the path printed by the following
echo
commands in your shell in Hydra, paste it in the file manager in Globus and press enter to jump to the corresponding storage partition:VSC_DATA:
echo $VSC_DATA
/data/brussel/100/vsc10001/
VSC_SCRATCH:
echo $VSC_SCRATCH
/scratch/brussel/100/vsc10001/
VSC_DATA_VO_USER:
echo $VSC_DATA_VO_USER
/data/brussel/vo/000/bvo00005/vsc10001/
VSC_SCRATCH_VO_USER:
echo $VSC_SCRATCH_VO_USER
/scratch/brussel/vo/000/bvo00005/vsc10001/
Tip
Create bookmarks in Globus to easily access your data in Hydra
Data transfer to/from your local computer#
Install and configure Globus Personal Connect in your local computer following the Globus documentation:
Globus CLI and Python SDK#
To automate your data transfers, in addition to the web interface, you can also use either the VSCdocGlobus CLI or the VSCdocPython SDK, both of which are available in Hydra upon loading the following module:
module load Globus-CLI
Connect VUB’s OneDrive with Globus#
VUB OneDrive is available in Globus with its own collection. The name of the collection is VUB OneDrive. Please follow the steps below to add VUB OneDrive to your Globus account:
Log in to the Globus web app with your account from any institute in the VSC network as described in VSCdocGlobus Access
Open the File Manager on the left panel
Write VUB OneDrive in the Collections field and select it
At this point, when you’re using the VSC VUB Hydra collection for the first time, you will be requested to give further authentication/consent to your account
Click Continue
Select the first identity from the list. For VUB users, the first identity in the list should be a link of the form username@vub.be
Click Allow
You are now redirected to the File Manager page, requesting authentication for a second time
Click Continue
In the Credentials tab, click Add credential
Click Continue
The storage of VUB OneDrive will open and you can navigate through it within Globus. You can now transfer data between VUB OneDrive and the VSC VUB Hydra collection or your local computer.
Tip
You can also access data in SharePoint through the VUB OneDrive collection in Globus. Navigate to the parent folder of your OneDrive collection and you will find your SharePoint sites as well.
WebDAV services as Nextcloud/ownCloud#
You can indeed copy files directly between Hydra and cloud services that support the WebDAV protocol, such as Nextcloud/ownCloud. This avoids copying the files to/from your local computer as an intermediate step. The davix tools are installed by default in the login nodes.
Example Using davix
to copy files between Hydra and your cloud service:
davix-get <cloud_url>/myfile.txt myfile.txt --userlogin <login> --userpass <passwd>
davix-put myfile.txt <cloud_url>/myfile.txt --userlogin <login> --userpass <passwd>
davix-put mydir <cloud_url>/mydir --userlogin <login> --userpass <passwd> -r 4
See also
The davix documentation.
Data in jobs#
Data management in jobs can be cumbersome to handle for those users running many jobs regularly. The solution is to automatize as much as possible in your job scripts. The shell scripting language used in job scripts is quite powerful and can be used to automatically create directories, copy data and change files before executing the code running your simulations.
Automatic working directory in scratch#
As described in HPC Data Storage, all job scripts must read/write data
from/to scratch to run with optimal performance. It can be either the personal
VSC_SCRATCH
or the scratch of your VO VSC_SCRATCH_VO
. However,
important datasets and results should not be left in the scratch and instead be
transferred to the more reliable storage of VSC_DATA
or VSC_DATA_VO
,
which has backups.
The job script below is a simple example to create a transient working
directory on-the-fly in VSC_SCRATCH
that will be used to read/write all
data during execution. The job will perform the following steps:
Create a new unique working directory in
VSC_SCRATCH
Copy all input files from a folder in
VSC_DATA
into the working directoryExecute the simulation using the files from the working directory
Copy the output file to the same directory from where the job was launched
Delete the working directory
1#!/bin/bash
2
3#SBATCH --job-name="your-job"
4#SBATCH --output="%x-%j.out"
5
6module load AwesomeSoftware/1.1.1
7
8# Input data directory in VSC_DATA
9DATADIR="${VSC_DATA}/my-project/dataset01"
10
11# Working directory for this job in VSC_SCRATCH
12WORKDIR="${VSC_SCRATCH:-/tmp}/${SLURM_JOB_NAME:-$USER}.${SLURM_JOB_ID:-0}"
13
14# Populate working directory
15echo "== Populating new working directory: $WORKDIR"
16mkdir -p "$WORKDIR"
17rsync -av "$DATADIR/" "$WORKDIR/"
18cd "$WORKDIR"
19
20# Start simulation
21# note: adapt to your case, input/output files might be handled differently
22<your-command> data.inp > results.out
23
24# Save output and clean the scratch
25# (these steps are optional, you can also perform these manually once the job finishes)
26cp -a results.out "$SLURM_SUBMIT_DIR/"
27rm -r "$WORKDIR"
Automatic data transfer in jobs#
Automatic (scripted) data transfer between Hydra and external SSH servers can be
safely done using rsync
in Hydra with a secure SSH connection without
password. The authentication with the external server is done with a specific
pair of keys not requiring any additional password or passphrase from the user.
Once the passwordless SSH connection between Hydra and the external server is
configured, rsync
can use it to transfer data between them.
Important
The only caveat of this method is that anybody gaining access to your Hydra account will automatically gain access to your account in the external server as well. Therefore, it is very important that you use a user account in the external server that is exclusively used for sending/receiving files to/from Hydra and that has limited user rights.
The following steps show the easiest way to setup a secure connection without password to an external server:
Check the connection to the external server from Hydra: Login to Hydra and try to connect to the external server with a regular SSH connection using a password. If this step does not work, your server may not be reachable from Hydra and you should contact the administrators of the external server to make it accessible:
$ ssh <username>@<hostname.external.server>
Create a SSH key pair without passphrase: Login to Hydra and create a new pair of SSH keys that will be exclusively used for data transfers with external servers. The new keys have to be stored inside the
.ssh
folder in your home directory. In the example below, the new key is calledid_filetransfer
. Leave the passphrase field empty to avoid any password prompt on authentication:$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/your/home/.ssh/id_rsa): </your/home/.ssh/id_filetransfer> Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in id_filetransfer. Your public key has been saved in id_filetransfer.pub. [...]
Transfer the keys to the external server: The new key created in Hydra without a passphrase has to be installed in the external server as well. In this step you will have to provide your password to connect to the external server:
$ ssh-copy-id -i ~/.ssh/id_filetransfer <username>@<hostname.external.server>
Configure the connection to the external server: The specific keys used in the connection with the external server can be defined in the file
~/.ssh/config
. This avoids having to explicitly set the option-i ~/.ssh/id_filetransfer
on every SSH connection. Add the following lines at the bottom of your~/.ssh/config
file in Hydra (create the file if it does not exist):1Host <hostname.external.server> 2 User <username> 3 IdentityFile ~/.ssh/id_filetransfer
Check the passwordless connection: At this point it should be possible to connect from Hydra to the external server with the new keys and without any prompt for a password:
$ ssh <username>@<hostname.external.server>
Automatic copy of files: Once the passwordless SSH connection is properly configured,
rsync
will automatically use it. You can execute the following commands in Hydra to either transfer data to the external server or from the external server:$ rsync -av /path/to/source <username>@<hostname.external.server>:/path/to/destination
$ rsync -av <username>@<hostname.external.server>:/path/to/source /path/to/destination