4. GPU Job Types#

Jobs for GPUs are not different than standard non-GPU jobs. They will get a certain number of CPU cores and memory as described in previous sections. GPUs is an extra resource on top of those and its allocation is controlled by its own family of options. The job scheduler Slurm will automatically identify jobs requesting GPUs and send those to nodes with GPU accelerators.

Warning

Not all software modules can be used on GPUs. Only those modules with CUDA in their version string support offloading onto GPUs.

Slurm provides several options to request GPUs, you might find the following common ones in the Slurm documentation or other sources of information:

  • --gpus=X sets the total amount of GPUs allocated to the job to X

  • --gpus-per-node=X allocates X GPUs for each node where the job runs

  • --gpus-per-task=X allocates X GPUs for each task requested by the job

  • --gpus-per-socket=X allocates X GPUs for each CPU socket used by the job

  • --gres gpu:X older option that allocates X GPUs per node (equivalent to --gpus-per-node)

4.1. GPU generation#

Jobs can request a specific GPU generation or model with the following options:

  • -p pascal_gpu for the Nvidia P100

  • -p ampere_gpu for the Nvidia A100

For instance, you might need to use a specific GPU type to reproduce previous results, or if your job needs more GPU memory than what is available in older GPU models. The characteristics of our GPUs are listed in VSCdocHydra Hardware. Keep in mind that more specific job requests will probably have to wait longer in the queue.

4.2. Memory settings of GPU jobs#

The amount of system memory assigned to your job automatically scales with the number of CPU cores requested and follows the same rules as for non-GPU jobs.

Alternatively you can use --mem-per-gpu=X to define the amount of system memory depending on the number of GPU allocated to your job. This setting is not related to the memory of the GPU cards though, it only affects the memory available on the CPUs.

4.3. Single GPU jobs#

Recommended Use --gpus in your single GPU jobs.

All GPU options in Slurm work well for single GPU jobs. We recommend requesting a single GPU with --gpus=1 for simplicity. The option --gpus does not need any other considerations beyond the amount of requested GPUs.

Basic multi-core, single-GPU Slurm batch script#
1#!/bin/bash
2#SBATCH --job-name=mygpujob
3#SBATCH --time=04:00:00
4#SBATCH --gpus=1
5#SBATCH --cpus-per-task=16
6
7module load CoolGPUSoftware/x.y.z-foss-2021a-CUDA-11.3.1
8
9<cool-gpu-program>

Applications executed on GPUs still need some amount of CPU power to work. By default, all jobs will only get 1 task with 1 CPU core. If your software will execute more than 1 process in parallel or multiple independent tasks on the GPUs, then you can use the option --ntasks to set the number of tasks and/or --cpus-per-task to set the amount of CPU cores for each of those tasks.

Important

It is not allowed to request more cores per GPU than those available to it. For nodes with 2 GPUs that is half the cores of the node. Our hardware specifications show the amount of cores available in the nodes of our clusters.

4.4. Multi GPU jobs#

Recommended Use --gpus-per-node combined with --ntasks-per-node in your multi-GPU jobs.

Jobs can request as many GPUs as available in each partition of GPUs in the cluster (it is not limited to a single node). In this case, we recommend requesting the number of nodes with --nodes=N and adjusting how many GPUs on each node it will use use with --gpus-per-node=G. Hence, the total number of GPUs for your job will be N × G.

In the example below, the job requests 4 GPUs in total (2 GPUs in 2 nodes) and 2 tasks on each node with 16 CPU cores. The hardware specifications show the distribution of GPUs and nodes in each partition.

Important

Not all software supports using multiple GPUs in different nodes. In case of doubt, check the documentation of your software or contact VUB-HPC Support

Example Slurm batch script with 4 GPUs in 2 nodes#
 1#!/bin/bash
 2#SBATCH --job-name=mygpujob
 3#SBATCH --time=04:00:00
 4#SBATCH --nodes=2
 5#SBATCH --gpus-per-node=2
 6#SBATCH --ntasks-per-node=2
 7#SBATCH --cpus-per-task=16
 8
 9module load CoolGPUSoftware/x.y.z-foss-2021a-CUDA-11.3.1
10
11srun -n 1 --exact <cool-gpu-program> <input_1> &
12srun -n 1 --exact <cool-gpu-program> <input_2> &
13srun -n 1 --exact <cool-gpu-program> <input_3> &
14srun -n 1 --exact <cool-gpu-program> <input_4> &
15wait

Avoid setting job tasks with either --ntasks or --ntasks-per-gpu as those can result in bad task distribution or job errors. On the other hand, the option --gpus will work well for multi-GPU jobs as long as it is also combined with --ntasks-per-node to set the tasks.

Beware, that any tasks/processes on nodes with multiple GPUs allocated will be able to use all local GPUs. This means that it is up to the software used in the job to properly distribute the work among the GPUs. Ideally you application should use the nearest CPU to the GPU in use. If you need to manually force a task to some specific GPU, you can use the job options --gpus-per-task and --ntasks instead.

4.5. Advanced: task distribution in GPUs#

Slurm provides many options to configure the request of GPU resources for your jobs. We have seen that all those options can behave differently depending on which other options are used in the job. Usually, those differences do not manifest in single GPU jobs, but they can impact jobs executing multiple tasks on multiple GPUs.

The following tables show how various options will distribute tasks among the GPUs allocated to the job. The resulting task distribution is color coded in 4 main outcomes:

N – N Correct task distribution

Tasks are evenly distributed among GPUs, each task can access a single GPU.

2N – 2N Undefined task distribution

Tasks are correctly distributed among the CPUs bound to each GPU, but they can access all GPUs allocated to the job on that node. This outcome is not necessarily bad, it is up to the software ran in the job to pick the correct GPU for each task/process.

I – J Wrong task distribution

Tasks are assigned to a single GPU, but the distribution does not follow the configuration set in the job. This outcome will hinder performance as the distribution of tasks is not what was intended for the job.

error Bad configuration

Job will not start due to errors or due to the wrong binding of CPU/GPU resources as tasks would be distributed in the wrong CPU socket for the allocated GPU.

Note

Our recommendations for Single GPU jobs and Multi GPU jobs are based on this results.

4.5.1. Option –gpus#

Distribution of tasks across requested GPUs using the --gpus option of sbatch. Examples carried out on the nodes of the ampere_gpu partition with 2 GPUs per node and 16 CPU cores per GPU.

Distribution of tasks with --gpus and --ntasks#

--ntasks
1 GPU in 1 node
--gpus=1 --nodes=1
2 GPUs in 1 node
--gpus=2 --nodes=1
2 GPUs in 2 nodes
--gpus=2 --nodes=2

Total Tasks

GPU0 – GPU1

GPU0 – GPU1

GPU0 – GPU1

2

2 – 0

2 – 2

1 – 1

8

8 – 0

8 – 8

7 – 1

16

16 – 0

16 – 16

15 – 1

24

disallowed

24 – 24

16 – 8

32

disallowed

32 – 32

16 – 16

Distribution of tasks with --gpus and --ntasks-per-node#

--ntasks-per-node
1 GPU in 1 node
--gpus=1 --nodes=1
2 GPUs in 1 node
--gpus=2 --nodes=1
2 GPUs in 2 nodes
--gpus=2 --nodes=2

Total Tasks

GPU0 – GPU1

GPU0 – GPU1

GPU0 – GPU1

2

2 – 0

2 – 2

1 – 1

8

8 – 0

8 – 8

4 – 4

16

16 – 0

16 – 16

8 – 8

24

disallowed

24 – 24

12 – 12

32

disallowed

32 – 32

16 – 16

Distribution of tasks with --gpus and --ntasks-per-gpu#

--ntasks-per-gpu
1 GPU in 1 node
--gpus=1 --nodes=1
2 GPUs in 1 node
--gpus=2 --nodes=1
2 GPUs in 2 nodes
--gpus=2 --nodes=2

Total Tasks

GPU0 – GPU1

GPU0 – GPU1

GPU0 – GPU1

2

2 – 0

error

1 – 1

8

8 – 0

error

4 – 4

16

16 – 0

error

8 – 8

24

disallowed

error

12 – 12

32

disallowed

error

16 – 16

4.5.2. Option –gpus-per-node#

Distribution of tasks across requested GPUs using the --gpus-per-node option of sbatch. Examples carried out on the nodes of the ampere_gpu partition with 2 GPUs per node and 16 CPU cores per GPU.

Distribution of tasks with --gpus-per-node and --ntasks#

--ntasks
1 GPU in 1 node
--gpus-per-node=1 --nodes=1
2 GPUs in 1 node
--gpus-per-node=2 --nodes=1
2 GPUs in 2 nodes
--gpus-per-node=1 --nodes=2

Total Tasks

GPU0 – GPU1

GPU0 – GPU1

GPU0 – GPU1

2

2 – 0

2 – 2

1 – 1

8

8 – 0

8 – 8

7 – 1

16

16 – 0

16 – 16

15 – 1

24

disallowed

24 – 24

16 – 8

32

disallowed

32 – 32

16 – 16

Distribution of tasks with --gpus-per-node and --ntasks-per-node#

--ntasks-per-node
1 GPU in 1 node
--gpus-per-node=1 --nodes=1
2 GPUs in 1 node
--gpus-per-node=2 --nodes=1
2 GPUs in 2 nodes
--gpus-per-node=1 --nodes=2

Total Tasks

GPU0 – GPU1

GPU0 – GPU1

GPU0 – GPU1

2

2 – 0

2 – 2

1 – 1

8

8 – 0

8 – 8

4 – 4

16

16 – 0

16 – 16

8 – 8

24

disallowed

24 – 24

12 – 12

32

disallowed

32 – 32

16 – 16

Distribution of tasks with --gpus-per-node and --ntasks-per-gpu#

--ntasks-per-gpu
1 GPU in 1 node
--gpus-per-node=1 --nodes=1
2 GPUs in 1 node
--gpus-per-node=2 --nodes=1
2 GPUs in 2 nodes
--gpus-per-node=1 --nodes=2

Total Tasks

GPU0 – GPU1

GPU0 – GPU1

GPU0 – GPU1

2

2 – 0

error

1 – 1

8

8 – 0

error

4 – 4

16

16 – 0

error

8 – 8

24

disallowed

error

12 – 12

32

disallowed

error

16 – 16

4.5.3. Option –gpus-per-task#

Distribution of tasks across requested GPUs using the --gpus-per-task option of sbatch. Examples carried out on the nodes of the ampere_gpu partition with 2 GPUs per node and 16 CPU cores per GPU.

Distribution of tasks with --gpus-per-task and --ntasks#

--ntasks
1 GPU in 1 node
--gpus-per-task=1 --nodes=1
2 GPUs in 1 node
--gpus-per-task=2 --nodes=1
2 GPUs in 2 nodes
--gpus-per-task=1 --nodes=2

Total Tasks

GPU0 – GPU1

GPU0 – GPU1

GPU0 – GPU1

1

1 – 0

2

error

1 – 1

Distribution of tasks with --gpus-per-task and --ntasks-per-node#

--ntasks-per-node
1 GPU in 1 node
--gpus-per-task=1 --nodes=1
2 GPUs in 1 node
--gpus-per-task=2 --nodes=1
2 GPUs in 2 nodes
--gpus-per-task=1 --nodes=2

Total Tasks

GPU0 – GPU1

GPU0 – GPU1

GPU0 – GPU1

1

1 – 0

2

error

1 – 1