Public datasets#
We provide central storage for various public and freely usable datasets and
databases in the shared directory /databases
. The data in there is
accessible by all users of the HPC cluster. The advantage of this central
location is 2-fold:
space saving: avoids that many users have to download the same dataset in their own account.
scientific reproducibility: ensures that different users can use the same version of the dataset.
Users who use public data to run their calculations should always check first if
it is already available in /databases
.
Helpdesk We can add new databases (or update existing ones) to the HPC cluster upon request.
Protein Data Bank#
The PDB database can be found in
/databases/bio/PDB
and is automatically updated on a weekly basis.
HuggingFace Datasets Hub#
The HuggingFace Datasets Hub provides
an extensive collection of machine learning (ML) datasets and models, many of
which are freely accessible. Upon user request, several of these datasets have
been centrally installed at /databases/huggingface
. To use them, set the
following environment variables before launching your scripts:
export HF_HUB_CACHE=/databases/huggingface/hub
export HF_HUB_OFFLINE=1
Licensed datasets#
Some centrally installed datasets require the user to obtain a license before
usage. Those are only accessible to members of a specific user group. For
example, Google’s Gemma models on Hugging Face are accessible to members of the
bgemma
group:
ls -hld /databases/huggingface/hub/[gG]oogle-[gG]emma-*
drwxr-x--- 5 vsc10001 bgemma 4.0K Jun 7 14:48 /databases/huggingface/hub/models--google--gemma-2-2b
drwxr-x--- 5 vsc10001 bgemma 4.0K Jun 7 14:57 /databases/huggingface/hub/models--google--gemma-2-2b-it
drwxr-x--- 3 vsc10001 bgemma 4.0K Apr 9 11:07 /databases/huggingface/hub/models--Google--Gemma-2b
Here are the steps to become a member of a license user group:
Fill out the license form provided by the dataset copyright holder to request/obtain a license.
Wait until your request has been accepted.
Login on VSC account page and click the tab
New/Join Group
.In section Join Group, select the group under
Group
.Under
Message
, write the following sentence:I have accepted the terms of the license at <link-to-license-agreement>, and my request for access has been accepted by the copyright holder.
Click
Submit
.
The table below shows a list of licensed datasets with the license user group and a link to the license form:
Datasets |
Group |
License form |
---|---|---|
Google’s Gemma models family |
|
|
Meta’s Llama 3.1 models & evals |
|
Tip
Check your dataset request status on Hugging Face.