3. Specific Use Cases#
3.1. AlphaFold#
AlphaFold runs in three main steps:
multiple sequence alignment (MSA) using the AlphaFold databases (very I/O intensive)
inference step: runs the pretrained neural network on GPU
a short OpenMM minimization of the predicted structure
AlphaFold is installed for the Nvidia Ampere GPUs in Hydra. Submit your jobs
with the options --gpus-per-node=1 --partition=ampere_gpu
to specifically
request one of those GPUs.
1#!/bin/bash
2#SBATCH --partition=ampere_gpu
3#SBATCH --nodes=1
4#SBATCH --gpus-per-node=1
5#SBATCH --cpus-per-gpu=16
6
7export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-pipe
8export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-mps-log
9nvidia-cuda-mps-control -d
10
11module load AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
12
13run_alphafold.py \
14 --model_preset=monomer_casp14 \
15 --fasta_paths=<protein_sequence>.fasta \
16 --max_template_date=<YYYY-MM-DD> \
17 --output_dir=$SLURM_SUBMIT_DIR
Tip
For a protein already available in the databases used by AlphaFold, set
max_template_date
to a date prior its publication date to avoid using
its experimental structure as a template.
You only need the sequence file (FASTA format) of the protein for your job. We
provide the full datasets for AlphaFold in a central location accessible to all
users. The path to those datasets is /databases/bio/alphafold-<version>
,
where <version> is any of the installed versions of AlphaFold. The software
modules of AlphaFold are configured to use the datasets in this shared central
location by default.
3.1.1. Using local SSD storage for large datasets#
The datasets in /databases
can be directly used in your jobs. However, the
performance of the shared storage can still be a bottleneck in your jobs as the
MSA step of AlphaFold is very intensive in random read patterns. We
have a caching system in place that greatly improves access times, but it is
only beneficial to series of runs (in a single job or multiple jobs) using
common DBs and executed in the same node.
Hydra has several ampere_gpu
nodes specially suited to run AlphaFold. These
nodes have a very fast local scratch storage made of
SSD drives with a total capacity of 5.9 TB. Therefore, the performance of
the MSA step can be greatly improved by copying the DBs to the local
$TMPDIR
of the compute node running your job. Follow these steps to
leverage this fast SSD storage in your AlphaFold jobs:
Submit your jobs specifically to nodes with a big SSD storage using the
sbatch
option--constraint="big_local_ssd"
In your job, first copy the DBs to your job’s temporary folder
$TMPDIR
. This folder is located in the big SSD storage. We highly recommend usingfpsync
for this task, which is a parallel rsync toolfpsync -n 8 $ALPHAFOLD_DATA_DIR $TMPDIR/alphafold-data-dir
Once the selected DBs are in the temporary storage, you can change the location of AlphaFold’s data dir by updating the
$ALPHAFOLD_DATA_DIR
(environment) variable after having loaded the AlphaFold moduleexport ALPHAFOLD_DATA_DIR="$TMPDIR/alphafold-data-dir"
Example AlphaFold job script with data dir in big local SSD storage
#!/bin/bash
#SBATCH --partition=ampere_gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-gpu=16
#SBATCH --constraint="big_local_ssd"
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-pipe
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-mps-log
nvidia-cuda-mps-control -d
module load AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0
fpsync -n $SLURM_CPUS_PER_TASK $ALPHAFOLD_DATA_DIR $TMPDIR/alphafold-data-dir
export ALPHAFOLD_DATA_DIR="$TMPDIR/alphafold-data-dir"
run_alphafold.py \
--model_preset=monomer_casp14 \
--fasta_paths=<protein_sequence>.fasta \
--max_template_date=<YYYY-MM-DD> \
--output_dir=$SLURM_SUBMIT_DIR
3.2. Ansys Fluent#
Note
VUB-HPC does not provide licenses for Ansys Fluent. Research departments can bring their own license to use this software in our HPC clusters.
Ansys Fluent has its own MPI implementations and does not need any additional
software module beyond loading a FLUENT/XXXX
module to work in the HPC.
Benchmarks in our Tier-2 cluster Hydra showed that the best performance in
Fluent is achieved with its default MPI implementation ibmmpi
. However,
results may vary depending on the characteristics of your simulation. Hence, it
is always recommended to run your own benchmarks with your own simulations
before executing the production runs in the cluster.
The following is an exmaple job script to run Fluent simulations in parallel with MPI using the nodes in Hydra with fast InfiniBand network:
1#!/bin/bash
2#SBATCH --job-name=FluentRun
3#SBATCH --time=72:00:00
4#SBATCH --ntasks=64
5#SBATCH --partition=skylake_mpi
6
7# generate nodelist file
8nodelist=$(scontrol show hostname $SLURM_NODELIST)
9printf "%s\n" "${nodelist[@]}" > nodefile
10
11module load FLUENT/2023R1
12
13fluent 2ddp -g -t$SLURM_NTASKS -cnf=nodefile -pinfiniband -platform=intel -cflush < job.input > job.output
14
15rm nodefile
3.3. CESM/CIME#
The dependencies required to run CESM in Hydra are provided by the module
CESM-deps
. This module also contains the XML configuration files for CESM
with the specification of machines, compiler and batch system of Hydra. Once
CESM-deps
is loaded, the configuration files can be found in
${EBROOTCESMMINDEPS}/machines
.
The following steps show an example setup of a CESM/CIME case
Load the module
CESM-deps
module load CESM-deps/2-foss-2022a
All data files for CESM have to be placed in
$VSC_SCRATCH/cesm
(DIN_LOC_ROOT
)mkdir $VSC_SCRATCH/cesm
Users that want to store their data elsewhere (e.g. in a Virtual Organization) can link that location to the
cesm
folder in their$VSC_SCRATCH
ln -sf $VSC_SCRATCH_VO_USER/cesm $VSC_SCRATCH/cesm
Create the following folder structure for your CESM cases in
$VSC_SCRATCH
mkdir $VSC_SCRATCH/cesm/cases mkdir $VSC_SCRATCH/cesm/output mkdir $VSC_SCRATCH/cesm/sources
An extensive input dataset for CESM is already available to all users in Hydra at
/databases/climate/cesm/inputdata
. Link that folder to yourcesm
directoryln -sf /databases/climate/cesm/inputdata $VSC_SCRATCH/cesm/inputdata
Download the source code of CESM/CIME into
$VSC_SCRATCH/cesm/sources
:cd $VSC_SCRATCH/cesm/sources git clone -b release-cesm2.2.2 https://github.com/ESCOMP/cesm.git cesm-2.2.2
cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2 ./manage_externals/checkout_externals
Add the configuration settings for Hydra and Breniac to your CESM/CIME source code
cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2 update-cesm-machines cime/config/cesm/machines/ $EBROOTCESMMINDEPS/machines/
Optional Add support for iRODS
$ cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2 $ git -C cime/ describe --tags cime5.8.32
$ cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2 $ git apply $EBROOTCESMMINDEPS/irods/cime-5.8.32/*.patch
The creation of a case follows the usual procedure for CESM. Create your case inside
$VSC_SCRATCH/cesm/cases
cd $VSC_SCRATCH/cesm/sources/cesm-2.2.2/cime/scripts ./create_newcase --case $VSC_SCRATCH/cesm/cases/name_of_case --res f19_g17 --compset I2000Clm50BgcCrop --compiler gnu
Your CESM case can now be setup, built and launched. You can do so manually from the login node using the standard procedure with the
case.setup
,case.build
andcase.submit
scripts in the case folder.We also provide a job script called
case.slurm
to automatically perform all these steps in one go in the compute nodes. This approach minimizes wait times in the queue and ensures that the nodes building and running the case are compatible. You can copy the template in$EBROOTCESMMINDEPS/scripts/case.slurm
to your case and modify it as needed (addingxmlchange
or any other commands). Once the script is adapted to your needs, submit it to the queue withsbatch
as any other job:cd $VSC_SCRATCH/cesm/cases/name_of_case cp $EBROOTCESMMINDEPS/scripts/case.slurm $VSC_SCRATCH/cime/cases/name_of_case/
$EDITOR case.slurm
sbatch --ntasks=40 --time=24:00:00 case.slurm
The module CESM-tools
provides a set of tools commonly used to analyse and
visualize CESM data. Nonetheless, CESM-tools
cannot be loaded at the same
time as CESM-deps
because their packages have incompatible dependencies.
Once you obtain the results of your case, unload any modules with module
purge
and load CESM-tools/2-foss-2019a
to post-process the data of your
case.
3.4. CP2K#
CP2K usually runs with best performance by distributing the workload of the
simulation across MPI tasks with 1 CPU core per task. Following the standard
approach of parallel MPI jobs. In this case, it is
important to disable multi-threading by setting the environment variable
OMP_NUM_THREADS=1
.
1#!/bin/bash
2#SBATCH --time=24:00:00
3#SBATCH --ntask=8
4
5module load CP2K/9.1-foss-2022a
6export OMP_NUM_THREADS=1
7
8srun cp2k.popt -i example.inp -o example.out
However, it is not uncommon to hit memory leaks in CP2K once the simulation
size scales up (see issues
cp2k#1830 or
cp2k#2657).
Depending on the version of CP2K and the underlying MPI implementation, the
aforementioned approach with a pure MPI job might not work. You can identify
such an issue if you see segmentation fault errors in the output of your job
containing corrupted size vs. prev_size
or corrupted double-linked list
in the error message. In such a case, the alternative approach is to use a
hybrid job, combining MPI tasks and threads. In the following example a
simulation is run in 80 CPU cores, using 40 MPI tasks with 2 cores per task:
1#!/bin/bash
2#SBATCH --time=24:00:00
3#SBATCH --ntasks=40
4#SBATCH --cpus-per-task=2
5
6module load CP2K/9.1-foss-2022a
7
8export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
9
10srun cp2k.psmp -i example.inp -o example.out
Increasing the number of CPU cores per MPI task is detrimental to the performance of CP2K though. Hence, the number of cores per task should be kept as low as possible, using a pure MPI job when possible. In our tests, the CPU time of each MD step increased by 50% by going from 2 to 20 cores per task.
Tasks |
Cores per task |
CPU time of MD step (s) |
40 |
2 |
3.9 |
16 |
5 |
4.3 |
8 |
10 |
4.8 |
4 |
20 |
5.8 |
3.5. GAMESS-US#
GAMESS-US has its own command tool rungms
to launch its simulations in
parallel. Using mpirun
or srun
is not needed.
1#!/bin/bash
2#SBATCH --time=6:00:00
3#SBATCH --cpus-per-task=1
4#SBATCH --ntasks-per-node=4
5#SBATCH --nodes=2
6
7module load GAMESS-US/20220930-R2-gompi-2022a
8
9rungms <your_project>.inp $EBVERSIONGAMESSMINUS $SLURM_NTASKS ${SLURM_NTASKS_PER_NODE:-1} > <your_project>.out
This example job will request 2 nodes with 4 CPU cores per node. If you need
more cores, increase the number of nodes (preferably) or/and the number of tasks
per node. The job script will automatically pass those values to rungms
with
the $SLURM_
variables. Replace <your_project>
with the name of your
project files.
GAMESS-US is configured to use the storage in your $VSC_SCRATCH
. All the
files generated by GAMESS-US will be located in specific folders for each of
your jobs.
3.6. GAP#
The GAP shell has a strong focus on being used interactively, whereas on Hydra the preferred way to run calculations is by submitting job scripts. Nonetheless, it is possible to use the interactive shell of GAP in our compute nodes with the following steps
Request an interactive job session and wait for it to be allocated:
$ srun --cpus-per-task=4 --time=3:0:0 --pty bash -l srun: job <jobID> queued and waiting for resources srun: job <jobID> has been allocated resources vsc10xxx@node361 ~ $
Load the module of GAP and start its shell as usual:
vsc10xxx@node361 ~ $ module load gap/4.11.0-foss-2019a vsc10xxx@node361 ~ $ gap ********* GAP 4.11.0 of 29-Feb-2020 * GAP * https://www.gap-system.org ********* Architecture: x86_64-pc-linux-gnu-default64-kv7 [...] gap>
Submitting a job script using GAP is also possible and requires preparing two scripts. One is the usual job script to be submitted to the queue and the second one is the script with the commands for GAP.
The job script is a standard job script requesting the resources needed by your calculation, loading the required modules and executing the script with the code for GAP.
1#!/bin/bash 2#SBATCH --time=60:0 3#SBATCH --cpus-per-task=4 4 5module load gap/4.11.0-foss-2019a-modisomTob 6 7./gap-script.sh
The script
gap-script.sh
is a shell script that executes GAP and passes your code to it. It is necessary to execute GAP with the-A
option and only load the required GAP packages at the beginning of your script to avoid issues.1#!/bin/bash 2gap -A -r -b -q << EOI 3LoadPackage( "Example" ); 42+2; 5EOI
Important
Keep in mind to make
gap-script.sh
executable with the commandchmod +x gap-script.sh
3.7. Gaussian#
The available modules for Gaussian can be listed with the command:
module --show-hidden spider Gaussian
We recommend using the module Gaussian/G16.A.03-intel-2017b
for general use
because its performance has been thoroughly optimized for Hydra. More recent
modules, such as Gaussian/G16.B.01
, should be used if you need any of their
new features.
Gaussian jobs can use significantly more memory than the value specified by
%mem
in the input file or with g16 -m
in the execution command.
Therefore, it is recommended to submit Gaussian jobs requesting a total memory
that is at least 20% larger than the memory value defined in the calculation.
Gaussian G16 should automatically manage the available resources and
parallelization. However, it is known to under-perform in some circumstances and
not use all cores allocated to your job. In Hydra, the command myresources
will report the actual use of resources of your jobs. If any of your Gaussian
calculations is not using all available cores, it is possible to force the total
number of cores used by Gaussian G16 with the option g16 -p
or by adding the
Gaussian directive %nprocshared
to the top of the input file.
The following job script is an example to be used for Gaussian calculations
running on 1 node with multiple cores. In this case we are running a g16
calculation with 80GB of memory (-m=80GB
), but requesting a total of 20 *
5GB = 100GB of memory (25% more). Additionally, we are requesting 20 cores for
this job and automatically passing this setting to g16
with the option
-p=${SLURM_CPUS_PER_TASK:-1}
, where ${SLURM_CPUS_PER_TASK}
is an
environment variable that contains the number of cores allocated to your job.
1#!/bin/bash
2#SBATCH --cpus-per-task=20
3#SBATCH --mem-per-cpu=5GB
4
5ml Gaussian/G16.A.03-intel-2017b
6
7g16 -p=${SLURM_CPUS_PER_TASK:-1} -m=80GB < input_file.com > output_file.log
3.8. GaussView#
GaussView is a graphical interface used with the computational chemistry program Gaussian. GaussView is installed in Hydra and can be used alongside Gaussian to enable all property visualizations.
Login to Hydra enabling X11 forwarding. Linux and macOS users can do so by adding the option
-Y
to thessh
command used for login. See below:ssh -Y <username>@login.hpc.vub.be
Load the modules of GaussView
GaussView 6 with Gaussian/G16.A.03:
module load GaussView/6.0.16
GaussView 6 with Gaussian/G16.B.01:
module load Gaussian/G16.B.01 module load GaussView/6.0.16
Launch GaussView:
gview.sh
Keep in mind that using a graphical interface in Hydra is currently rather slow. Thus, for regular visualization tasks, the HPC team recommends installing GaussView in your personal computer. Binary packages of GaussView are available for Linux, Mac, and Windows and are provided upon request to VUB-HPC Support.
3.9. GROMACS#
3.9.1. Threading Models#
GROMACS supports two threading models, which can be used together:
OpenMP threads
thread-MPI threads: MPI-based threading model implemented as part of GROMACS, incompatible with process-based MPI models such as OpenMPI
There are two variants of the GROMACS executable:
gmx
: recommended for all single-node jobs, supports both OpenMP threads and thread-MPI threadsgmx_mpi
: for multi-node jobs: must be used withsrun
, only supports OpenMP threads
The number of threads must always be specified, as GROMACS sets it incorrectly on Hydra:
gmx
: use option-nt
to let GROMACS determine optimal numbers of OpenMP threads and thread-MPI threadsgmx_mpi
: use option-ntomp
(not-ntmpi
or-nt
), and set number of threads equal to 1
Running on 1 or more GPUs, by default GROMACS will:
detect the number of available GPUs, create 1 thread-MPI thread for each GPU, and evenly divide the available CPU cores between the GPUs using OpenMP threads. Therefore,
--cpus-per-task
should be a multiple of the number of GPUs. Always check in the log file that the correct number of GPUs is indeed detected.optimally partition the force field terms between the GPU(s) and the CPU cores, depending on the number of GPUs and CPU cores and their respective performances.
3.9.2. Job Scripts#
To get good parallel performance, GROMACS must be launched differently depending on the requested resources (#nodes, #cores, and #GPUs). In the example job scripts given below, a molecular dynamics simulation is launched with run input file example.tpr:
single-node, multi-core
1#!/bin/bash 2#SBATCH --time=1:0:0 3#SBATCH --cpus-per-task=4 4 5module load GROMACS/2020.4-foss-2020a-Python-3.8.2 6 7gmx mdrun -nt ${SLURM_CPUS_PER_TASK:-1} -s example.tpr
multi-node
1#!/bin/bash 2#SBATCH --time=1:0:0 3#SBATCH --ntasks=8 4 5module load GROMACS/2020.4-foss-2020a-Python-3.8.2 6 7srun gmx_mpi mdrun -ntomp 1 -s example.tpr
single-GPU, single-node, multi-core
1#!/bin/bash 2#SBATCH --time=1:0:0 3#SBATCH --nodes=1 4#SBATCH --gpus-per-node=1 5#SBATCH --cpus-per-task=4 6 7module load GROMACS/2019.3-fosscuda-2019a 8 9gmx mdrun -nt ${SLURM_CPUS_PER_TASK:-1} -s example.tpr
multi-GPU, single-node, multi-core
1#!/bin/bash 2#SBATCH --time=1:0:0 3#SBATCH --nodes=1 4#SBATCH --gpus-per-node=2 5#SBATCH --cpus-per-task=8 6 7module load GROMACS/2019.3-fosscuda-2019a 8 9gmx mdrun -nt ${SLURM_CPUS_PER_TASK:-1} -s example.tpr
See also
The chapter Getting good performance from mdrun in the GROMACS manual for more information on running GROMACS efficiently.
3.10. Mathematica#
First check which Mathematica versions are available:
module spider Mathematica
Next load a suitable version:
module load Mathematica/12.3.1
Tip
Use the most recent version for new projects
For light-weight interactive use, you can run Mathematica in console mode with WolframScript:
wolframscript
Mathematica scripts (Wolfram Language code files) are also executed with the
wolframscript
command, but should be submitted to the queue in a job script.
In the following example, we run the Wolfram Language file testmath.wl:
1#!/bin/bash
2#SBATCH --time=1:0:0
3#SBATCH --cpus-per-task=1
4
5module load Mathematica/12.3.1
6
7wolframscript -file testmath.wl
Tip
Mathematica code is not optimized for performance. However, it supports several levels of interfacing to C/C++. For example, you can speed up your compute intensive functions by compiling them with a C compiler from inside your Mathematica script.
3.11. MATLAB#
MATLAB is available as a module, however it is not recommended to run intensive MATLAB calculations on Hydra: its performance is not optimal and parallel execution is not fully supported.
First check which MATLAB versions are available:
module spider MATLAB
Next load a suitable version:
module load MATLAB/2021a
Tip
Use the most recent version for new projects
It is possible to run MATLAB in console mode for quick tests. For example, with a MATLAB script called testmatlab.m, type:
matlab -batch "run('testmatlab.m');"
MATLAB scripts should be submitted to the queue in a job script. Before
submitting, however, we highly recommend to first compile your script using the
MATLAB compiler mcc
(this can be done on the login node):
mcc -m testmatlab.m
This will generate a testmatlab
binary file, as well as an executable
run_testmatlab.sh
shell script (and a few other files, which you can
ignore). The shell script makes sure the environment is set correctly before
executing the binary.
Now you can submit your matlab calculation as a batch job. Your job script should look like this:
1#!/bin/bash
2#SBATCH --time=01:00:00
3#SBATCH --cpus-per-task=1
4
5module load MATLAB/2021a
6
7./run_testmatlab.sh $EBROOTMATLAB &> testmatlab.out
The advantage of running a compiled matlab binary is that it does not require a license. We have only a limited number of MATLAB licenses that can be used at the same time, so in this way you can run your simulation even if the all licenses are in use.
See also
The mcc documentation for more information on using the MATLAB compiler.
3.12. matplotlib#
The HPC environment is optimized for the execution of non-ineractive
applications in job scripts. Therefore, matplotlib
is configured with a
non-GUI backend (Agg
) that can save the resulting plots in a variety of
image file formats. The generated image files can be copied to your own computer
for visualization or further editing.
If you need to work interactively with matplotlib
and visualize its output
from within Hydra, you can do so with the following steps
Login to Hydra enabling X11 forwarding. Linux and macOS users can use the following command:
ssh -Y username@login.hpc.vub.be
Enable the
TkAgg
backend at the very beginning of your Python script:1import matplotlib 2matplotlib.use('TkAgg')
Note
The function
matplotlib.use()
must be called before importingmatplotlib.pyplot
. Changing the backend parameter in yourmatplotlibrc
file will not have any effect as the system-wide configuration takes precedence over it.
3.13. OpenFold#
OpenFold is a trainable PyTorch-based implementation of AlphaFold.
OpenFold is installed for the Nvidia Ampere GPUs in Hydra. Submit your jobs with
the options --gpus-per-node=1 --partition=ampere_gpu
to specifically request
one of those GPUs. If more memory is required for your job, increase the number
of CPU-cores with --cpus-per-gpu
(up to 16 cores per GPU).
1#!/bin/bash
2#SBATCH --partition=ampere_gpu
3#SBATCH --nodes=1
4#SBATCH --gpus-per-node=1
5#SBATCH --cpus-per-gpu=8
6
7module load OpenFold/1.0.1-foss-2021a-CUDA-11.3.1
8
9# AlphaFold databases location:
10ALPHAFOLD_DATA_DIR=/databases/bio/alphafold-2.2.0
11
12# OpenFold parameters location:
13OPENFOLD_PARAMETERS_DIR=/databases/bio/openfold-1.0.1/openfold_params/
14
15run_pretrained_openfold.py \
16 fasta_dir \
17 $ALPHAFOLD_DATA_DIR/pdb_mmcif/mmcif_files/ \
18 --uniref90_database_path $ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
19 --mgnify_database_path $ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
20 --pdb70_database_path $ALPHAFOLD_DATA_DIR/pdb70/pdb70 \
21 --uniclust30_database_path $ALPHAFOLD_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
22 --output_dir ./OUTPUT \
23 --bfd_database_path $ALPHAFOLD_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
24 --model_device "cuda:0" \
25 --jackhmmer_binary_path $EBROOTHMMER/bin/jackhmmer \
26 --hhblits_binary_path $EBROOTHHMINSUITE/bin/hhblits \
27 --hhsearch_binary_path $EBROOTHHMINSUITE/bin/hhsearch \
28 --kalign_binary_path $EBROOTKALIGN/bin/kalign \
29 --cpus=16 \
30 --config_preset "model_1_ptm" \
31 --openfold_checkpoint_path $OPENFOLD_PARAMETERS_DIR/finetuning_ptm_2.pt
You only need a directory (fasta_dir in example above) that contains one or more sequence files (FASTA format) of the protein for your job.
We provide the full AlphaFold datasets in a shared storage accessible to all
users. The path to those datasets is /databases/bio/alphafold-<version>
,
where <version> is any of the installed versions of AlphaFold. The path is
defined in the example job script by the variable ALPHAFOLD_DATA_DIR
.
The OpenFold parameters are found in
/databases/bio/openfold-<version>/openfold_params/
, where <version> is any
of the installed versions of OpenFold. The path is defined in the example job
script by the variable $OPENFOLD_PARAMETERS_DIR
.
As explained in the AlphaFold sub-section on Using local SSD storage for large datasets, performance of the MSA step can be greatly improved by copying the datasets to the local scratch:
#SBATCH --constraint="big_local_ssd"
module load OpenFold/1.0.1-foss-2021a-CUDA-11.3.1
fpsync -n 12 $ALPHAFOLD_DATA_DIR $TMPDIR/alphafold-data-dir
ALPHAFOLD_DATA_DIR="$TMPDIR/alphafold-data-dir"
3.14. ORCA#
Using ORCA in parallel has the particularity of having to execute orca
including the full path of the executable. This can be easily achieved by
relying on the $EBROOTORCA
environment variable, which always points to the
installation directory of ORCA. Additionally, ORCA handles MPI on its own, so it
is not necessary to explicitly use mpirun
or srun
with it.
1#!/bin/bash
2#SBATCH --time=3:00:00
3#SBATCH --ntasks=8
4
5module load ORCA/5.0.1-gompi-2021a
6
7$EBROOTORCA/bin/orca example.inp
The previous example uses 8 parallel tasks (--ntasks=8
). Keep in mind to
always define the number of processors in ORCA (%PAL NPROCS 8 END
) to be the
number of tasks in your job.
Warning
ORCA v5.0.0 has a bug that causes parallel jobs using exactly 1 task per node to fail. Use version 5.0.1 to solve this issue.
3.15. PyTorch#
PyTorch internally uses threads to run in parallel, but it makes some assumptions to determine the number of threads that do not apply to Hydra. This usually results in jobs with too many threads, saturating the allocated cores and hindering its performance. For optimal performance, the total number of threads should be equal to the number of cores allocated to your job. PyTorch can be configured to follow this rule by adding the following lines near the beginning of your Python script:
1torch.set_num_threads(len(os.sched_getaffinity(0)))
2torch.set_num_interop_threads(1)
Note
If you are using Python multiprocessing on top of PyTorch, then your job will generate N threads for each PyTorch process (P). Resulting in a total number of threads N × P. In this case, you should adapt your job making sure that the number of requested cores is equal to the total number of threads (N × P).
3.15.1. PyTorch with multiple GPUs#
PyTorch can leverage multiple GPUs across multiple nodes thanks to the
DistributedDataParallel
class, which provides data parallelism by synchronizing gradients across each
model replica on the GPUs. However, using DistributedDataParallel
needs
specific adaptations not only in your job script but also in your model
training scripts.
The job should only request 1 task per GPU. The following is taken from the documentation of PyTorch Distributed:
To use DistributedDataParallel on a host with N GPUs, you should spawn up N processes, ensuring that each process exclusively works on a single GPU from 0 to N-1.
Define --gpus-per-node
as needed, e.g 2 GPUs per node is the standard in
Hydra. Set the amount of tasks with --ntasks-per-node
to guarantee a
correct distribution of tasks across the GPUs in the different nodes. Both
options should always be set to the same value to ensure that the job uses 1
tasks per GPU.
1#!/bin/bash
2#SBATCH --nodes=2
3#SBATCH --gpus-per-node=2
4#SBATCH --ntasks-per-node=2
5#SBATCH --cpus-per-task=16
6#SBATCH --partition=ampere_gpu
7
8module load PyTorch/1.12.0-foss-2022a-CUDA-11.7.0
9
10srun -n $SLURM_NTASKS python pytorch_multigpu_script.py
Your PyTorch script has to be launched in your job with srun
. This works in
the same way as any other MPI job. srun
will
initialize the MPI stack, set up the network connection between the nodes and
GPUs and start all the tasks on each GPU.
The training script has to use DistributedDataParallel
to be able to
synchronize your model across the GPUs of your job. The following is a minimal
example that distributes/copies some model across the GPUs of your jobs and
initializes a DistributedDataParallel
instance for the distributed training.
1import os
2import subprocess
3import torch
4import torchvision
5
6mpi_rank = int(os.environ['SLURM_PROCID']) # ID of the current process (or rank)
7mpi_size = int(os.environ['SLURM_NTASKS']) # Total number of processes (tasks in Slurm)
8node_list = os.environ['SLURM_NODELIST'] # List of allocated nodes
9
10master_host = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1")
11os.environ['MASTER_ADDR'] = master_host
12master_port = 12300 + (int(os.environ['SLURM_JOB_ID']) % 100)
13os.environ['MASTER_PORT'] = str(master_port)
14
15# Generate unique GPU IDs combining local GPU number with the corresponding MPI rank
16local_gpus = torch.cuda.device_count()
17gpu_device_local_id = mpi_rank % local_gpus
18gpu_device_local = torch.device(gpu_device_local_id)
19torch.cuda.set_device(gpu_device_local_id)
20
21# Initialize process group using world size and rank from the job environment.
22torch.distributed.init_process_group(backend='nccl', world_size=mpi_size, rank=mpi_rank)
23
24# Copy the model into the GPUs
25model = torchvision.models.resnet18(weights=None)
26model = model.to(gpu_device_local)
27
28# Start synchronization
29ddp_model = torch.nn.parallel.DistributedDataParallel(
30 model,
31 device_ids=[gpu_device_local_id],
32 output_device=gpu_device_local_id,
33)
34
35# The rest of your training script follows...
See also
PyTorch also provides the torchrun
utility which can simplify the process of launching distributed workloads.
You can execute your training directly with torchrun training_script.py
and it can take care of setting up the distributed model (i.e. configuring
MPI ranks and world size) from more user-friendly command line options.
3.16. R#
Depending on your needs there are different methods to use R in Hydra:
Interactive sessions should be performed in the compute nodes
Start an interactive job in a compute node adjusting the number of cores
--cpus-per-task
as appropriate (by default R only uses 1 core):srun --cpus-per-task=1 --pty bash -l
Load your R module of choice (preferably a recent version):
module load R/4.3.2-gfbf-2023a
Recommended Load extra R packages. For instance, those in the big bundle of packages from CRAN
module load R-bundle-CRAN/2023.12-foss-2023a
Start the interactive R shell:
R
Scripts written in R can be executed with the command
Rscript
. A minimal job script for R only requires loading the R module and executing your scripts withRscript
1#!/bin/bash 2#SBATCH --time=1:00:00 3#SBATCH --cpus-per-task=1 4 5module load R/4.3.2-gfbf-2023a 6module load R-bundle-CRAN/2023.12-foss-2023a 7 8Rscript <path-to-script.R>
Tip
The quality of the graphics generated by R can be improved by changing
the graphical backend to Cairo. Add the following lines to the file
~/.Rprofile
to make these changes permanent for your user (create the
file ~/.Rprofile
if it does not exist)
1# Use cairo backend for graphics device
2setHook(packageEvent("grDevices", "onLoad"),
3 function(...) grDevices::X11.options(type='cairo'))
4
5# Use cairo backend for bitmaps
6options(bitmapType='cairo')
3.17. SRA-Toolkit#
The NCBI-VDB client in SRA-Toolkit has to be configured upon first use. The
configuration covers data download and storage settings in your account in
Hydra. The main setting is defining the location of your user public
repository. We recommend using a folder in $VSC_DATA
or $VSC_SCRATCH
,
otherwise you might quickly fill up your home directory.
mkdir $VSC_DATA/ncbi
vdb-config --set /repository/user/main/public/root=$VSC_DATA/ncbi
Starting with v2.10.0, it is also possible to always store the data in the current working directory (whatever it is at the moment of execution). This option might be useful if you do not need to maintain a permanent data folder and just want to download the required data for your job on-the-fly. However, make sure to delete the data directories once your job completes, otherwise your scratch might fill up after multiple job runs.
vdb-config --prefetch-to-cwd
The complete set of options can be configured with the command vdb-config -i
3.18. Stata#
First check which Stata versions are available:
module spider Stata
Next load a suitable version:
module load Stata/16-legacy
Tip
Use the most recent version for new projects
Running Stata in console mode in the terminal for quick tests:
stata
Stata do-files should be submitted to the queue in a job script. In the following example, we run the Stata program teststata.do
1#!/bin/bash
2#SBATCH --time=1:0:0
3#SBATCH --cpus-per-task=1
4
5module load Stata/16-legacy
6
7stata-se -b do teststata
Upon execution, Stata will by default write its output to the log file teststata.log.
Note
The recommended version of Stata in batch mode is stata-se
,
because it can handle the larger datasets.