Public datasets#
We provide central storage for various public and freely usable datasets and
databases in the shared directory /databases
. The data in there is
accessible by all users of the HPC cluster. The advantage of this central
location is 2-fold:
space saving: avoids that many users have to download the same dataset in their own account.
scientific reproducibility: ensures that different users can use the same version of the dataset.
Users who use public data to run their calculations should always check first if
it is already available in /databases
.
Helpdesk We can add new databases (or update existing ones) to the HPC cluster upon request.
Protein Data Bank#
The PDB database can be found in
/databases/bio/PDB
and is automatically updated on a weekly basis.
HuggingFace Datasets Hub#
The HuggingFace Datasets Hub provides
an extensive collection of machine learning (ML) datasets and models, many of
which are freely accessible. Upon user request, several of these datasets have
been centrally installed at /databases/huggingface
. To use them, set the
following environment variables before launching your scripts:
export HF_HOME=/databases/huggingface
export HF_HUB_OFFLINE=1