Public datasets#

We provide central storage for various public and freely usable datasets and databases in the shared directory /databases. The data in there is accessible by all users of the HPC cluster. The advantage of this central location is 2-fold:

  • space saving: avoids that many users have to download the same dataset in their own account.

  • scientific reproducibility: ensures that different users can use the same version of the dataset.

Users who use public data to run their calculations should always check first if it is already available in /databases.

Helpdesk We can add new databases (or update existing ones) to the HPC cluster upon request.

Protein Data Bank#

The PDB database can be found in /databases/bio/PDB and is automatically updated on a weekly basis.

HuggingFace Datasets Hub#

The HuggingFace Datasets Hub provides an extensive collection of machine learning (ML) datasets and models, many of which are freely accessible. Upon user request, several of these datasets have been centrally installed at /databases/huggingface. To use them, set the following environment variables before launching your scripts:

export HF_HOME=/databases/huggingface
export HF_HUB_OFFLINE=1