Public datasets#

We provide central storage for various public and freely usable datasets and databases in the shared directory /databases. The data in there is accessible by all users of the HPC cluster. The advantage of this central location is 2-fold:

  • space saving: avoids that many users have to download the same dataset in their own account.

  • scientific reproducibility: ensures that different users can use the same version of the dataset.

Users who use public data to run their calculations should always check first if it is already available in /databases.

Helpdesk We can add new databases (or update existing ones) to the HPC cluster upon request.

Protein Data Bank#

The PDB database can be found in /databases/bio/PDB and is automatically updated on a weekly basis.

HuggingFace Datasets Hub#

The HuggingFace Datasets Hub provides an extensive collection of machine learning (ML) datasets and models, many of which are freely accessible. Upon user request, several of these datasets have been centrally installed at /databases/huggingface. To use them, set the following environment variables before launching your scripts:

export HF_HUB_CACHE=/databases/huggingface/hub
export HF_HUB_OFFLINE=1

Licensed datasets#

Some centrally installed datasets require the user to obtain a license before usage. Those are only accessible to members of a specific user group. For example, Google’s Gemma models on Hugging Face are accessible to members of the bgemma group:

ls -hld /databases/huggingface/hub/[gG]oogle-[gG]emma-*
drwxr-x--- 5 vsc10001 bgemma 4.0K Jun  7 14:48 /databases/huggingface/hub/models--google--gemma-2-2b
drwxr-x--- 5 vsc10001 bgemma 4.0K Jun  7 14:57 /databases/huggingface/hub/models--google--gemma-2-2b-it
drwxr-x--- 3 vsc10001 bgemma 4.0K Apr  9 11:07 /databases/huggingface/hub/models--Google--Gemma-2b

Here are the steps to become a member of a license user group:

  1. Fill out the license form provided by the dataset copyright holder to request/obtain a license.

  2. Wait until your request has been accepted.

  3. Login on VSC account page and click the tab New/Join Group.

  4. In section Join Group, select the group under Group.

  5. Under Message, write the following sentence:

    I have accepted the terms of the license at <link-to-license-agreement>, and my request for access has been accepted by the copyright holder.

  6. Click Submit.

The table below shows a list of licensed datasets with the license user group and a link to the license form:

Datasets

Group

License form

Google’s Gemma models family

bgemma

https://huggingface.co/google/gemma-2-2b

Meta’s Llama 3.1 models & evals

bllama3_1

https://huggingface.co/meta-llama/Llama-3.1-8B

Tip

Check your dataset request status on Hugging Face.