2. Slurm clusters#

2.1. Hydra: main cluster#

The Hydra cluster has multiple job queues: one for each partition. Jobs are automatically assigned to those partitions in the cluster that have the computational resources needed to fulfil the requirements of the job. For instance, jobs requesting a GPU will automatically queue to compute nodes with a GPU, and jobs requesting more than 1 node will automatically queue on a partition with the fastest network interconnect.

Important

Jobs in Hydra can run for a maximum of 120 hours (5 days)

Jobs submitted from the login nodes will be sent to job queue of Hydra by default.

2.2. Anansi: test/debug cluster#

Jobs for interactive use, testing, and debugging can be run on the Anansi cluster. As Hydra’s smaller sister cluster, Anansi is designed for short tasks and provides minimal queue times.

The main characteristics of Anansi are:

  • CPU cores can be shared by up to 4 jobs

  • GPU devices can be shared by up to 4 jobs

  • Maximum number of GPU devices per job is 1 (100% GPU fraction)

  • Maximum job walltime is 12 hours

In contrast to Hydra, where computational resources are allocated exclusively to single jobs, the resources of Anansi can be shared between multiple jobs. This applies to CPU cores which are shared transparently, and to GPUs which can be requested in fractions of a full device. Other resources like CPU memory are not shared.

We also limit the number of resources you can use across all your jobs running in Anansi. Any single user can use up to 16 CPU cores and up to 100% of a single GPU (4 GPU shards) across all their jobs on Anansi. Jobs submitted when this limit is already reached will wait in queue with a Slurm Job Reason code either being QOSMaxCpuPerUserLimit or (QOSMaxGRESPerUser). Those jobs will be held in queue until the other jobs of the user are finished and the resources are freed.

Single jobs requesting more than 16 CPU cores or more than 100% of a GPU (4 GPU shards) will be refused by the queue with the message:

Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)

This approach is specially useful for light-weight (interactive) tasks such as testing/debugging that can involve frequent idle periods. Hence, even though the size of Anansi is smaller than Hydra, users should be able to quickly find an available slot to run their short jobs.

Moreover, the maximum time a job can run in Anansi is limited to 12 hours, which further increases the availability of its resources.

Jobs can be submitted, managed and monitored in Anansi from the login nodes of Hydra. The target cluster of any Slurm command can be changed to Anansi by adding the option -M anansi to the command arguments or as a #SBATCH option in the header of your job script.

Command to check the job queue in Anansi#
$ mysqueue -M anansi
CLUSTER: anansi
     JOBID PARTITION   NAME          USER     STATE  TIME TIME_LIMIT NODES CPUS MIN_MEMORY NODELIST(REASON)
  50005540 ada_gpu     test.job  vsc10000   RUNNING  0:05      10:00     1    4         6G node510