3. Specific Use Cases#

3.1. AlphaFold#

AlphaFold runs in three main steps:

  1. multiple sequence alignment (MSA) using the AlphaFold databases (very I/O intensive)

  2. inference step: runs the pretrained neural network on GPU

  3. a short OpenMM minimization of the predicted structure

AlphaFold is installed for the Nvidia Ampere GPUs in Hydra. Submit your jobs with the options --gpus-per-node=1 --partition=ampere_gpu to specifically request one of those GPUs.

Example job script for AlphaFold#
 1#!/bin/bash
 2#SBATCH --partition=ampere_gpu
 3#SBATCH --nodes=1
 4#SBATCH --gpus-per-node=1
 5#SBATCH --cpus-per-gpu=16
 6
 7export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-pipe
 8export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-mps-log
 9nvidia-cuda-mps-control -d
10
11module load AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
12
13run_alphafold.py \
14    --model_preset=monomer_casp14 \
15    --fasta_paths=<protein_sequence>.fasta \
16    --max_template_date=<YYYY-MM-DD> \
17    --output_dir=$SLURM_SUBMIT_DIR

Tip

For a protein already available in the databases used by AlphaFold, set max_template_date to a date prior its publication date to avoid using its experimental structure as a template.

You only need the sequence file (FASTA format) of the protein for your job. We provide the full datasets for AlphaFold in a central location accessible to all users. The path to those datasets is /databases/bio/alphafold-<version>, where <version> is any of the installed versions of AlphaFold. The software modules of AlphaFold are configured to use the datasets in this shared central location by default.

3.1.1. Using local SSD storage for large datasets#

The datasets in /databases can be directly used in your jobs. However, the performance of the shared storage can still be a bottleneck in your jobs as the MSA step of AlphaFold is very intensive in random read patterns. We have a caching system in place that greatly improves access times, but it is only beneficial to series of runs (in a single job or multiple jobs) using common DBs and executed in the same node.

Hydra has several ampere_gpu nodes specially suited to run AlphaFold. These nodes have a very fast local scratch storage made of SSD drives with a total capacity of 5.9 TB. Therefore, the performance of the MSA step can be greatly improved by copying the DBs to the local $TMPDIR of the compute node running your job. Follow these steps to leverage this fast SSD storage in your AlphaFold jobs:

  1. Submit your jobs specifically to nodes with a big SSD storage using the sbatch option --constraint="big_local_ssd"

  2. In your job, first copy the DBs to your job’s temporary folder $TMPDIR. This folder is located in the big SSD storage. We highly recommend using fpsync for this task, which is a parallel rsync tool

    fpsync -n 8 $ALPHAFOLD_DATA_DIR $TMPDIR/alphafold-data-dir
    
  3. Once the selected DBs are in the temporary storage, you can change the location of AlphaFold’s data dir by updating the $ALPHAFOLD_DATA_DIR (environment) variable after having loaded the AlphaFold module

    export ALPHAFOLD_DATA_DIR="$TMPDIR/alphafold-data-dir"
    
Example AlphaFold job script with data dir in big local SSD storage
#!/bin/bash
#SBATCH --partition=ampere_gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-gpu=16
#SBATCH --constraint="big_local_ssd"

export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-pipe
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-mps-log
nvidia-cuda-mps-control -d

module load AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
fpsync -n $SLURM_CPUS_PER_TASK $ALPHAFOLD_DATA_DIR $TMPDIR/alphafold-data-dir
export ALPHAFOLD_DATA_DIR="$TMPDIR/alphafold-data-dir"

run_alphafold.py \
    --model_preset=monomer_casp14 \
    --fasta_paths=<protein_sequence>.fasta \
    --max_template_date=<YYYY-MM-DD> \
    --output_dir=$SLURM_SUBMIT_DIR

3.2. Ansys Fluent#

Note

VUB-HPC does not provide licenses for Ansys Fluent. Research departments can bring their own license to use this software in our HPC clusters.

Ansys Fluent has its own MPI implementations and does not need any additional software module beyond loading a FLUENT/XXXX module to work in the HPC.

Benchmarks in our Tier-2 cluster Hydra showed that the best performance in Fluent is achieved with its default MPI implementation ibmmpi. However, results may vary depending on the characteristics of your simulation. Hence, it is always recommended to run your own benchmarks with your own simulations before executing the production runs in the cluster.

The following is an exmaple job script to run Fluent simulations in parallel with MPI using the nodes in Hydra with fast InfiniBand network:

Example job script to run Fluent in parallel with MPI#
 1#!/bin/bash
 2#SBATCH --job-name=FluentRun
 3#SBATCH --time=72:00:00
 4#SBATCH --ntasks=64
 5#SBATCH --partition=skylake_mpi
 6
 7# generate nodelist file
 8nodelist=$(scontrol show hostname $SLURM_NODELIST)
 9printf "%s\n" "${nodelist[@]}" > nodefile
10
11module load FLUENT/2023R1
12
13fluent 2ddp -g -t$SLURM_NTASKS -cnf=nodefile -pinfiniband -platform=intel -cflush < job.input > job.output
14
15rm nodefile

3.3. CESM/CIME#

The dependencies required to run CESM in Hydra are provided by the module CESM-deps. This module also contains the XML configuration files for CESM with the specification of machines, compiler and batch system of Hydra. Once CESM-deps is loaded, the configuration files can be found in ${EBROOTCESMMINDEPS}/machines.

The following steps show an example setup of a CESM/CIME case

  1. Load the module CESM-deps

    module load CESM-deps/2-foss-2022a
    
  2. All data files for CESM have to be placed in $VSC_SCRATCH/cesm (DIN_LOC_ROOT)

    mkdir $VSC_SCRATCH/cesm
    

    Users that want to store their data elsewhere (e.g. in a Virtual Organization) can link that location to the cesm folder in their $VSC_SCRATCH

    Link CESM data folder to your VO#
    ln -sf $VSC_SCRATCH_VO_USER/cesm $VSC_SCRATCH/cesm
    
  3. Create the following folder structure for your CESM cases in $VSC_SCRATCH

    mkdir $VSC_SCRATCH/cesm/cases
    mkdir $VSC_SCRATCH/cesm/output
    mkdir $VSC_SCRATCH/cesm/sources
    
  4. An extensive input dataset for CESM is already available to all users in Hydra at /databases/climate/cesm/inputdata. Link that folder to your cesm directory

    ln -sf /databases/climate/cesm/inputdata $VSC_SCRATCH/cesm/inputdata
    
  5. Download the source code of CESM/CIME into $VSC_SCRATCH/cesm/sources:

    Clone a public release of CESM#
    cd $VSC_SCRATCH/cesm/sources
    git clone -b release-cesm2.2.2 https://github.com/ESCOMP/cesm.git cesm-2.2.2
    
    Clone external sources of CIME#
    cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2
    ./manage_externals/checkout_externals
    
  6. Add the configuration settings for Hydra and Breniac to your CESM/CIME source code

    cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2
    update-cesm-machines cime/config/cesm/machines/ $EBROOTCESMMINDEPS/machines/
    
  7. Optional Add support for iRODS

    Determine your version of CIME#
    $ cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2
    $ git -C cime/ describe --tags
    cime5.8.32
    
    Apply the patches for the closest CIME version in $EBROOTCESMMINDEPS/irods#
    $ cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2
    $ git apply $EBROOTCESMMINDEPS/irods/cime-5.8.32/*.patch
    
  8. The creation of a case follows the usual procedure for CESM. Create your case inside $VSC_SCRATCH/cesm/cases

    cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2/cime/scripts
    ./create_newcase --case $VSC_SCRATCH/cesm/cases/name_of_case --res f19_g17 --compset I2000Clm50BgcCrop --compiler gnu
    
  9. Your CESM case can now be setup, built and launched. You can do so manually from the login node using the standard procedure with the case.setup, case.build and case.submit scripts in the case folder.

    We also provide a job script called case.slurm to automatically perform all these steps in one go in the compute nodes. This approach minimizes wait times in the queue and ensures that the nodes building and running the case are compatible. You can copy the template in $EBROOTCESMMINDEPS/scripts/case.slurm to your case and modify it as needed (adding xmlchange or any other commands). Once the script is adapted to your needs, submit it to the queue with sbatch as any other job:

    Copy job template to your case folder#
    cd $VSC_SCRATCH/cesm/cases/name_of_case
    cp $EBROOTCESMMINDEPS/scripts/case.slurm $VSC_SCRATCH/cime/cases/name_of_case/
    
    Edit case.job if needed#
    $EDITOR case.slurm
    
    Submit your CESM job (adjust computational resources as needed)#
    sbatch --ntasks=40 --time=24:00:00 case.slurm
    

The module CESM-tools provides a set of tools commonly used to analyse and visualize CESM data. Nonetheless, CESM-tools cannot be loaded at the same time as CESM-deps because their packages have incompatible dependencies. Once you obtain the results of your case, unload any modules with module purge and load CESM-tools/2-foss-2019a to post-process the data of your case.

3.4. ColabFold#

ColabFold provides several notebooks that can be used on notebooks.hpc.vub.be. Once the ColabFold software module is loaded in the software module panel of JupyterLab, you can find the example notebooks in the folder $EBROOTCOLABFOLD/notebooks. That folder is read-only though, so make sure to copy any notebook file to your working directory of choice.

Alternatively, ColabFold can also be used in batch mode through a regular job script. The following is a example job script to generate multiple sequence applications (MSAs) from a given FASTA file and make folding predictions of its structure on a GPU.

Example ColabFold job script using mmseq2#
 1#!/bin/bash
 2#SBATCH --job-name=5AWL_1
 3#SBATCH --output="%x-%j.out"
 4#SBATCH --time=24:00:00
 5#SBATCH --partition=ampere_gpu
 6#SBATCH --cpus-per-task=16
 7#SBATCH --gpus-per-node=1
 8#SBATCH --mem=128G
 9
10module load ColabFold/1.5.2-foss-2022a-CUDA-11.7.0
11module load MMseqs2/14-7e284-gompi-2022a
12
13COLABFOLD_DATA="/databases/bio/colabfold-1.5.2"
14
15colabfold_search --threads $SLURM_CPUS_PER_TASK 5AWL_1.fasta $COLABFOLD_DATA 5AWL_1_msa
16colabfold_batch --amber --use-gpu-relax 5AWL_1_msa 5AWL_1_models

The file 5AWL_1.fasta is an example input file available in the repository of ColabFold.

3.5. CP2K#

CP2K usually runs with best performance by distributing the workload of the simulation across MPI tasks with 1 CPU core per task. Following the standard approach of parallel MPI jobs. In this case, it is important to disable multi-threading by setting the environment variable OMP_NUM_THREADS=1.

Example CP2K job script using 8 MPI tasks with 1 core per task.#
1#!/bin/bash
2#SBATCH --time=24:00:00
3#SBATCH --ntask=8
4
5module load CP2K/9.1-foss-2022a
6export OMP_NUM_THREADS=1
7
8srun cp2k.popt -i example.inp -o example.out

However, it is not uncommon to hit memory leaks in CP2K once the simulation size scales up (see issues cp2k#1830 or cp2k#2657). Depending on the version of CP2K and the underlying MPI implementation, the aforementioned approach with a pure MPI job might not work. You can identify such an issue if you see segmentation fault errors in the output of your job containing corrupted size vs. prev_size or corrupted double-linked list in the error message. In such a case, the alternative approach is to use a hybrid job, combining MPI tasks and threads. In the following example a simulation is run in 80 CPU cores, using 40 MPI tasks with 2 cores per task:

Example CP2K job script using 80 CPU cores, with 40 MPI tasks and 2 cores per task.#
 1#!/bin/bash
 2#SBATCH --time=24:00:00
 3#SBATCH --ntasks=40
 4#SBATCH --cpus-per-task=2
 5
 6module load CP2K/9.1-foss-2022a
 7
 8export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
 9
10srun cp2k.psmp -i example.inp -o example.out

Increasing the number of CPU cores per MPI task is detrimental to the performance of CP2K though. Hence, the number of cores per task should be kept as low as possible, using a pure MPI job when possible. In our tests, the CPU time of each MD step increased by 50% by going from 2 to 20 cores per task.

Performance of cores per task distribution in test simulation using CP2K/8.2-foss-2021a with 80 cores in total#

Tasks

Cores per task

CPU time of MD step (s)

40

2

3.9

16

5

4.3

8

10

4.8

4

20

5.8

3.6. GAMESS-US#

GAMESS-US has its own command tool rungms to launch its simulations in parallel. Using mpirun or srun is not needed.

Example job script for GAMESS-US#
1#!/bin/bash
2#SBATCH --time=6:00:00
3#SBATCH --cpus-per-task=1
4#SBATCH --ntasks-per-node=4
5#SBATCH --nodes=2
6
7module load GAMESS-US/20220930-R2-gompi-2022a
8
9rungms <your_project>.inp $EBVERSIONGAMESSMINUS $SLURM_NTASKS ${SLURM_NTASKS_PER_NODE:-1} > <your_project>.out

This example job will request 2 nodes with 4 CPU cores per node. If you need more cores, increase the number of nodes (preferably) or/and the number of tasks per node. The job script will automatically pass those values to rungms with the $SLURM_ variables. Replace <your_project> with the name of your project files.

GAMESS-US is configured to use the storage in your $VSC_SCRATCH. All the files generated by GAMESS-US will be located in specific folders for each of your jobs.

3.7. GAP#

The GAP shell has a strong focus on being used interactively, whereas on Hydra the preferred way to run calculations is by submitting job scripts. Nonetheless, it is possible to use the interactive shell of GAP in our compute nodes with the following steps

  1. Request an interactive job session and wait for it to be allocated:

    $ srun --cpus-per-task=4 --time=3:0:0 --pty bash -l
    srun: job <jobID> queued and waiting for resources
    srun: job <jobID> has been allocated resources
    vsc10xxx@node361 ~ $
    
  2. Load the module of GAP and start its shell as usual:

    vsc10xxx@node361 ~ $ module load gap/4.11.0-foss-2019a
    vsc10xxx@node361 ~ $ gap
    ********* GAP 4.11.0 of 29-Feb-2020
    * GAP * https://www.gap-system.org
    ********* Architecture: x86_64-pc-linux-gnu-default64-kv7
    [...]
    gap>
    

Submitting a job script using GAP is also possible and requires preparing two scripts. One is the usual job script to be submitted to the queue and the second one is the script with the commands for GAP.

  • The job script is a standard job script requesting the resources needed by your calculation, loading the required modules and executing the script with the code for GAP.

    Example job script for GAP#
    1#!/bin/bash
    2#SBATCH --time=60:0
    3#SBATCH --cpus-per-task=4
    4
    5module load gap/4.11.0-foss-2019a-modisomTob
    6
    7./gap-script.sh
    
  • The script gap-script.sh is a shell script that executes GAP and passes your code to it. It is necessary to execute GAP with the -A option and only load the required GAP packages at the beginning of your script to avoid issues.

    Example script to execute GAP in batch mode#
    1#!/bin/bash
    2gap -A -r -b -q << EOI
    3LoadPackage( "Example" );
    42+2;
    5EOI
    

    Important

    Keep in mind to make gap-script.sh executable with the command chmod +x gap-script.sh

3.8. Gaussian#

The available modules for Gaussian can be listed with the command:

module --show-hidden spider Gaussian

We recommend using the module Gaussian/G16.A.03-intel-2017b for general use because its performance has been thoroughly optimized for Hydra. More recent modules, such as Gaussian/G16.B.01, should be used if you need any of their new features.

Gaussian jobs can use significantly more memory than the value specified by %mem in the input file or with g16 -m in the execution command. Therefore, it is recommended to submit Gaussian jobs requesting a total memory that is at least 20% larger than the memory value defined in the calculation.

Gaussian G16 should automatically manage the available resources and parallelization. However, it is known to under-perform in some circumstances and not use all cores allocated to your job. In Hydra, the command myresources will report the actual use of resources of your jobs. If any of your Gaussian calculations is not using all available cores, it is possible to force the total number of cores used by Gaussian G16 with the option g16 -p or by adding the Gaussian directive %nprocshared to the top of the input file.

The following job script is an example to be used for Gaussian calculations running on 1 node with multiple cores. In this case we are running a g16 calculation with 80GB of memory (-m=80GB), but requesting a total of 20 * 5GB = 100GB of memory (25% more). Additionally, we are requesting 20 cores for this job and automatically passing this setting to g16 with the option -p=${SLURM_CPUS_PER_TASK:-1}, where ${SLURM_CPUS_PER_TASK} is an environment variable that contains the number of cores allocated to your job.

1#!/bin/bash
2#SBATCH --cpus-per-task=20
3#SBATCH --mem-per-cpu=5GB
4
5ml Gaussian/G16.A.03-intel-2017b
6
7g16 -p=${SLURM_CPUS_PER_TASK:-1} -m=80GB < input_file.com > output_file.log

3.9. GaussView#

GaussView is a graphical interface used with the computational chemistry program Gaussian. GaussView is installed in Hydra and can be used alongside Gaussian to enable all property visualizations.

  1. Login to Hydra enabling X11 forwarding. Linux and macOS users can do so by adding the option -Y to the ssh command used for login. See below:

    ssh -Y <username>@login.hpc.vub.be
    
  2. Load the modules of GaussView

    • GaussView 6 with Gaussian/G16.A.03:

      module load GaussView/6.0.16
      
    • GaussView 6 with Gaussian/G16.B.01:

      module load Gaussian/G16.B.01
      module load GaussView/6.0.16
      
  3. Launch GaussView:

    gview
    

Note

If the GaussView interface in Hydra is too slow for you, the HPC team recommends installing GaussView in your personal computer. Binary packages of GaussView are available for Linux, Mac, and Windows on softweb.

3.10. GROMACS#

3.10.1. Threading Models#

GROMACS supports two threading models, which can be used together:

  • OpenMP threads

  • thread-MPI threads: MPI-based threading model implemented as part of GROMACS, incompatible with process-based MPI models such as OpenMPI

There are two variants of the GROMACS executable:

  • gmx: recommended for all single-node jobs, supports both OpenMP threads and thread-MPI threads

  • gmx_mpi: for multi-node jobs: must be used with srun, only supports OpenMP threads

The number of threads must always be specified, as GROMACS sets it incorrectly on Hydra:

  • gmx: use option -nt to let GROMACS determine optimal numbers of OpenMP threads and thread-MPI threads

  • gmx_mpi: use option -ntomp (not -ntmpi or -nt), and set number of threads equal to 1

Running on 1 or more GPUs, by default GROMACS will:

  • detect the number of available GPUs, create 1 thread-MPI thread for each GPU, and evenly divide the available CPU cores between the GPUs using OpenMP threads. Therefore, --cpus-per-task should be a multiple of the number of GPUs. Always check in the log file that the correct number of GPUs is indeed detected.

  • optimally partition the force field terms between the GPU(s) and the CPU cores, depending on the number of GPUs and CPU cores and their respective performances.

3.10.2. Job Scripts#

To get good parallel performance, GROMACS must be launched differently depending on the requested resources (#nodes, #cores, and #GPUs). In the example job scripts given below, a molecular dynamics simulation is launched with run input file example.tpr:

  • single-node, multi-core

    1#!/bin/bash
    2#SBATCH --time=1:0:0
    3#SBATCH --cpus-per-task=4
    4
    5module load GROMACS/2020.4-foss-2020a-Python-3.8.2
    6
    7gmx mdrun -nt ${SLURM_CPUS_PER_TASK:-1} -s example.tpr
    
  • multi-node

    1#!/bin/bash
    2#SBATCH --time=1:0:0
    3#SBATCH --ntasks=8
    4
    5module load GROMACS/2020.4-foss-2020a-Python-3.8.2
    6
    7srun gmx_mpi mdrun -ntomp 1 -s example.tpr
    
  • single-GPU, single-node, multi-core

    1#!/bin/bash
    2#SBATCH --time=1:0:0
    3#SBATCH --nodes=1
    4#SBATCH --gpus-per-node=1
    5#SBATCH --cpus-per-task=4
    6
    7module load GROMACS/2019.3-fosscuda-2019a
    8
    9gmx mdrun -nt ${SLURM_CPUS_PER_TASK:-1} -s example.tpr
    
  • multi-GPU, single-node, multi-core

    1#!/bin/bash
    2#SBATCH --time=1:0:0
    3#SBATCH --nodes=1
    4#SBATCH --gpus-per-node=2
    5#SBATCH --cpus-per-task=8
    6
    7module load GROMACS/2019.3-fosscuda-2019a
    8
    9gmx mdrun -nt ${SLURM_CPUS_PER_TASK:-1} -s example.tpr
    

See also

The chapter Getting good performance from mdrun in the GROMACS manual for more information on running GROMACS efficiently.

3.11. MATLAB#

Note

MATLAB is available on the Tier-2 clusters of VUB to all students and research staff affiliated with VUB.

MATLAB is a popular programming language for numeric computing. It is a convenient tool to learn and triage mathematical models and algorithms. However, it is not recommended for compute intensive tasks as its performance is hindered by its distribution model, which provides limited optimization for the hardware of the cluster and limited parallel execution capabilities.

3.11.1. Graphical interface#

MATLAB is available in our notebook platform. You can launch MATLAB on your browser with the same graphical interface as its desktop application and run simulations on the HPC.

3.11.2. Shell interface#

MATLAB provides its own shell environment to run commands interactively. The MATLAB shell is available on any terminal interface of our clusters.

  1. Check the available MATLAB versions:

    module spider MATLAB
    
  2. Load a suitable version (pick the most recent one in case of doubt):

    module load MATLAB/2024a-r6
    
  3. Launch the MATLAB shell:

    matlab -nodisplay
    

3.11.3. Batch mode#

The main method to run your MATLAB code on the HPC is in batch mode. This requires preparing a job script that can run unattended on the cluster and execute some MATLAB .m script.

Running MATLAB in batch mode to execute some .m script is done with the -batch option. For example, you can run a MATLAB script called testmatlab.m with:

matlab -batch "run('testmatlab.m');"

Recommended Directly executing matlab -batch is usable for testing, but it can be slow and it is limited by the amount of active seats provided by out license. We recommend instead to first compile your .m script using the MATLAB compiler (mcc), which can be directly done on the terminal interface:

mcc -m testmatlab.m

The compilation with mcc will generate a testmatlab binary file, as well as an executable run_testmatlab.sh shell script (and a few other files, which you can ignore). The shell script makes sure the environment is set correctly before executing the binary with your script.

Now you can submit your matlab calculation as a batch job. Prepare a job script similar to the one below that loads some MATLAB module and executes the run_testmatlab.sh generated by mcc:

Basic single core MATLAB batch script#
1#!/bin/bash
2#SBATCH --time=01:00:00
3#SBATCH --cpus-per-task=1
4
5module load MATLAB/2024a-r6
6
7./run_testmatlab.sh $EBROOTMATLAB &> testmatlab.out

This job script can be submitted to the job scheduler with the command sbatch as any other serial job.

Tip

MATLAB only uses 1 CPU core by default. Running MATLAB in parallel on multiple cores requires special care. See the guide MATLAB Multicore for more information.

3.12. matplotlib#

The HPC environment is optimized for the execution of non-ineractive applications in job scripts. Therefore, matplotlib is configured with a non-GUI backend (Agg) that can save the resulting plots in a variety of image file formats. The generated image files can be copied to your own computer for visualization or further editing.

If you need to work interactively with matplotlib and visualize its output from within Hydra, you can do so with the following steps

  1. Login to Hydra enabling X11 forwarding. Linux and macOS users can use the following command:

    ssh -Y username@login.hpc.vub.be
    
  2. Enable the TkAgg backend at the very beginning of your Python script:

    1import matplotlib
    2matplotlib.use('TkAgg')
    

    Note

    The function matplotlib.use() must be called before importing matplotlib.pyplot. Changing the backend parameter in your matplotlibrc file will not have any effect as the system-wide configuration takes precedence over it.

3.13. OpenFold#

OpenFold is a trainable PyTorch-based implementation of AlphaFold.

OpenFold is installed for the Nvidia Ampere GPUs in Hydra. Submit your jobs with the options --gpus-per-node=1 --partition=ampere_gpu to specifically request one of those GPUs. If more memory is required for your job, increase the number of CPU-cores with --cpus-per-gpu (up to 16 cores per GPU).

Example job script for OpenFold#
 1#!/bin/bash
 2#SBATCH --partition=ampere_gpu
 3#SBATCH --nodes=1
 4#SBATCH --gpus-per-node=1
 5#SBATCH --cpus-per-gpu=8
 6
 7module load OpenFold/1.0.1-foss-2021a-CUDA-11.3.1
 8
 9# AlphaFold databases location:
10ALPHAFOLD_DATA_DIR=/databases/bio/alphafold-2.2.0
11
12# OpenFold parameters location:
13OPENFOLD_PARAMETERS_DIR=/databases/bio/openfold-1.0.1/openfold_params/
14
15run_pretrained_openfold.py \
16    fasta_dir \
17    $ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files/ \
18    --uniref90_database_path $ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
19    --mgnify_database_path $ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
20    --pdb70_database_path $ALPHAFOLD_DATA_DIR/pdb70/pdb70 \
21    --uniclust30_database_path $ALPHAFOLD_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
22    --output_dir ./OUTPUT \
23    --bfd_database_path $ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
24    --model_device "cuda:0" \
25    --jackhmmer_binary_path $EBROOTHMMER/bin/jackhmmer \
26    --hhblits_binary_path $EBROOTHHMINSUITE/bin/hhblits \
27    --hhsearch_binary_path $EBROOTHHMINSUITE/bin/hhsearch \
28    --kalign_binary_path $EBROOTKALIGN/bin/kalign \
29    --cpus=16 \
30    --config_preset "model_1_ptm" \
31    --openfold_checkpoint_path $OPENFOLD_PARAMETERS_DIR/finetuning_ptm_2.pt

You only need a directory (fasta_dir in example above) that contains one or more sequence files (FASTA format) of the protein for your job.

We provide the full AlphaFold datasets in a shared storage accessible to all users. The path to those datasets is /databases/bio/alphafold-<version>, where <version> is any of the installed versions of AlphaFold. The path is defined in the example job script by the variable ALPHAFOLD_DATA_DIR.

The OpenFold parameters are found in /databases/bio/openfold-<version>/openfold_params/, where <version> is any of the installed versions of OpenFold. The path is defined in the example job script by the variable $OPENFOLD_PARAMETERS_DIR.

As explained in the AlphaFold sub-section on Using local SSD storage for large datasets, performance of the MSA step can be greatly improved by copying the datasets to the local scratch:

Copy AlphaFold data dir to local scratch#
#SBATCH --constraint="big_local_ssd"

module load OpenFold/1.0.1-foss-2021a-CUDA-11.3.1
fpsync -n 12 $ALPHAFOLD_DATA_DIR $TMPDIR/alphafold-data-dir
ALPHAFOLD_DATA_DIR="$TMPDIR/alphafold-data-dir"

3.14. ORCA#

Using ORCA in parallel has the particularity of having to execute orca including the full path of the executable. This can be easily achieved by relying on the $EBROOTORCA environment variable, which always points to the installation directory of ORCA. Additionally, ORCA handles MPI on its own, so it is not necessary to explicitly use mpirun or srun with it.

Example job script for ORCA#
1#!/bin/bash
2#SBATCH --time=3:00:00
3#SBATCH --ntasks=8
4
5module load ORCA/5.0.1-gompi-2021a
6
7$EBROOTORCA/bin/orca example.inp

The previous example uses 8 parallel tasks (--ntasks=8). Keep in mind to always define the number of processors in ORCA (%PAL NPROCS 8 END) to be the number of tasks in your job.

Warning

ORCA v5.0.0 has a bug that causes parallel jobs using exactly 1 task per node to fail. Use version 5.0.1 to solve this issue.

3.15. PyTorch#

PyTorch internally uses threads to run in parallel, but it makes some assumptions to determine the number of threads that do not apply to Hydra. This usually results in jobs with too many threads, saturating the allocated cores and hindering its performance. For optimal performance, the total number of threads should be equal to the number of cores allocated to your job. PyTorch can be configured to follow this rule by adding the following lines near the beginning of your Python script:

1torch.set_num_threads(len(os.sched_getaffinity(0)))
2torch.set_num_interop_threads(1)

Note

If you are using Python multiprocessing on top of PyTorch, then your job will generate N threads for each PyTorch process (P). Resulting in a total number of threads N × P. In this case, you should adapt your job making sure that the number of requested cores is equal to the total number of threads (N × P).

3.15.1. Distributed Training#

PyTorch can leverage multiple GPUs across multiple nodes thanks to DistributedDataParallel, which is a pure PyTorch implementation (without third-party libraries) that provides data parallelism. This approach is based on synchronizing gradients across each model replica on the GPUs, which means that each GPU runs an identical copy of the model.

See also

PyTorch provides since v1.11 a more advanced approach called Fully Sharded Data Parallel (FSDP). FSDP allows to carry out distributed trainings avoiding the copy of the full model parameters over each GPUs by cutting the model into smaller shards and distributing them on the GPUs handling that part of the training. This results in an overall smaller memory footprint, which enables the training of even larger models on existing hardware. We recommend combining FSDP with HugginFace Accelerate to easily setup, launch and run distributed trainings.

Running a distributed training with DistributedDataParallel needs specific adaptations. Not only in your job script but also in your training scripts.

The job should only request 1 task per GPU. The following is taken from the documentation of PyTorch Distributed:

To use DistributedDataParallel on a host with N GPUs, you should spawn up N processes, ensuring that each process exclusively works on a single GPU from 0 to N-1.

Define --gpus-per-node as needed, e.g. 2 GPUs per node is the standard in Hydra. Set the amount of tasks with --ntasks-per-node to guarantee a correct distribution of tasks across the GPUs in the different nodes. Both options should always be set to the same value to ensure that the job uses 1 tasks per GPU.

Example job script for PyTorch with 4 GPUs in 2 nodes#
 1#!/bin/bash
 2#SBATCH --nodes=2
 3#SBATCH --gpus-per-node=2
 4#SBATCH --ntasks-per-node=2
 5#SBATCH --cpus-per-task=16
 6#SBATCH --partition=ampere_gpu
 7
 8module load PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1
 9
10master_host=$(scontrol show hostname $SLURM_NODELIST | head -n1)
11master_port=$(( 29500 + ($SLURM_JOB_ID % 10000) ))
12
13srun torchrun \
14    --nnodes=$SLURM_NNODES \
15    --nproc-per-node=$SLURM_NTASKS_PER_NODE \
16    --max-restarts=3 \
17    --rdzv-id=$SLURM_JOB_ID \
18    --rdzv-backend=c10d \
19    --rdzv-endpoint=$master_host:$master_port \
20    your_training_script.py

Your PyTorch script has to be launched in your job with srun. This works in the same way as any other MPI job. srun will initialize the MPI stack, set up the network connection between the nodes and start all the tasks on each GPU.

PyTorch provides a custom utility called torchrun specifically designed to help launch multi-node trainings. The example above executes torchrun with the same amount of nodes and tasks requested by the job and also automatically sets the communication backend (so-called rendezvous) by defining a unique ID and communication port based on the job’s ID. The same code will work on all your jobs without further modifications.

The training script has to use DistributedDataParallel to be able to synchronize your model across the GPUs of your job. The example multinode.py script from PyTorch is a very good starting point. The Trainer.__init__() and ddp_setup() methods should work out-of-the-box. You just need to modify the actual training bits in that script. If you need further help to adapt your trainings, please contact VUB-HPC Support.

DDP without torchrun

It is also possible to use DistributedDataParallell without torchrun by manually setting everything up in your training script. In this case, the job script is much simpler:

Example job script for multi-node PyTorch without torchrun#
 1#!/bin/bash
 2#SBATCH --nodes=2
 3#SBATCH --gpus-per-node=2
 4#SBATCH --ntasks-per-node=2
 5#SBATCH --cpus-per-task=16
 6#SBATCH --partition=ampere_gpu
 7
 8module load PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1
 9
10srun python your_training_script.py

However, initializing the process group in your training script needs more work and you will have to update the Trainer.__init__() and ddp_setup() methods in the previous example script multinode.py as follows:

Modifications of multinode.py script to work without torchrun#
 1# <header of multinode.py>
 2
 3MPI_RANK = int(os.environ['SLURM_PROCID'])  # ID of the current process (or rank)
 4MPI_SIZE = int(os.environ['SLURM_NTASKS'])  # Total number of processes (tasks in Slurm)
 5
 6def ddp_setup():
 7    import subprocess
 8
 9    node_list = os.environ['SLURM_NODELIST']  # List of allocated nodes
10    master_host = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1")
11    os.environ['MASTER_ADDR'] = master_host
12    master_port = 12300 + (int(os.environ['SLURM_JOB_ID']) % 100)
13    os.environ['MASTER_PORT'] = str(master_port)
14
15    local_rank = MPI_RANK % torch.cuda.device_count()
16    torch.cuda.set_device(local_rank)
17
18    torch.distributed.init_process_group(backend='nccl', world_size=MPI_SIZE, rank=MPI_RANK)
19
20class Trainer:
21    def __init__(
22        self,
23        model: torch.nn.Module,
24        train_data: DataLoader,
25        optimizer: torch.optim.Optimizer,
26        save_every: int,
27        snapshot_path: str,
28    ) -> None:
29        self.global_rank = int(MPI_RANK)
30        self.local_rank = self.global_rank % torch.cuda.device_count()
31        self.gpu_device = torch.device(self.local_rank)
32        self.model = model.to(self.gpu_device)
33
34        self.train_data = train_data
35        self.optimizer = optimizer
36        self.save_every = save_every
37        self.epochs_run = 0
38        self.snapshot_path = snapshot_path
39        if os.path.exists(snapshot_path):
40            print("Loading snapshot")
41            self._load_snapshot(snapshot_path)
42
43        self.model = DDP(self.model, device_ids=[self.local_rank])
44
45# <the rest of the training script follows>

See also

The tutorial on distributed training from PyTorchGeometric. It shows a similar approach based on pure PyTorch without any third-party modules or torchrun and they also provide example training scripts.

3.16. R#

Depending on your needs there are different methods to use R in Hydra:

  • Interactive sessions should be performed in the compute nodes

    1. Start an interactive job in a compute node adjusting the number of cores --cpus-per-task as appropriate (by default R only uses 1 core):

      srun --cpus-per-task=1 --pty bash -l
      
    2. Load your R module of choice (preferably a recent version):

      module load R/4.3.2-gfbf-2023a
      
    3. Recommended Load extra R packages. For instance, those in the big bundle of packages from CRAN

      module load R-bundle-CRAN/2023.12-foss-2023a
      
    4. Start the interactive R shell:

      R
      
  • Scripts written in R can be executed with the command Rscript. A minimal job script for R only requires loading the R module and executing your scripts with Rscript

    1#!/bin/bash
    2#SBATCH --time=1:00:00
    3#SBATCH --cpus-per-task=1
    4
    5module load R/4.3.2-gfbf-2023a
    6module load R-bundle-CRAN/2023.12-foss-2023a
    7
    8Rscript <path-to-script.R>
    

Tip

The quality of the graphics generated by R can be improved by changing the graphical backend to Cairo. Add the following lines to the file ~/.Rprofile to make these changes permanent for your user (create the file ~/.Rprofile if it does not exist)

1# Use cairo backend for graphics device
2setHook(packageEvent("grDevices", "onLoad"),
3    function(...) grDevices::X11.options(type='cairo'))
4
5# Use cairo backend for bitmaps
6options(bitmapType='cairo')

3.17. SRA-Toolkit#

The NCBI-VDB client in SRA-Toolkit has to be configured upon first use. The configuration covers data download and storage settings in your account in Hydra. The main setting is defining the location of your user public repository. We recommend using a folder in $VSC_DATA or $VSC_SCRATCH, otherwise you might quickly fill up your home directory.

Set your user public repository in VSC_DATA#
mkdir $VSC_DATA/ncbi
vdb-config --set /repository/user/main/public/root=$VSC_DATA/ncbi

Starting with v2.10.0, it is also possible to always store the data in the current working directory (whatever it is at the moment of execution). This option might be useful if you do not need to maintain a permanent data folder and just want to download the required data for your job on-the-fly. However, make sure to delete the data directories once your job completes, otherwise your scratch might fill up after multiple job runs.

Always use current working directory to download data#
vdb-config --prefetch-to-cwd

The complete set of options can be configured with the command vdb-config -i

3.18. Stata#

First check which Stata versions are available:

module spider Stata

Next load a suitable version:

module load Stata/16-legacy

Tip

Use the most recent version for new projects

Running Stata in console mode in the terminal for quick tests:

stata

Stata do-files should be submitted to the queue in a job script. In the following example, we run the Stata program teststata.do

1#!/bin/bash
2#SBATCH --time=1:0:0
3#SBATCH --cpus-per-task=1
4
5module load Stata/16-legacy
6
7stata-se -b do teststata

Upon execution, Stata will by default write its output to the log file teststata.log.

Note

The recommended version of Stata in batch mode is stata-se, because it can handle the larger datasets.