Hydra Overview 2024#

../../../_images/overview-2024.jpg

In the year 2024 a lot of time was spent on the tender processes to procure the next VSC Tier-1 supercomputer hosted by VUB. Nevertheless our Tier-2 HPC cluster Hydra did undergo several key changes. It’s time for some statistics on the usage of the infrastructure on 2024.

Highlights of 2024 for Hydra:

  • 2.6 centuries of single core CPU compute time were used

  • 21.2 years of GPU compute time were used

  • 1,500,695 jobs were run. The large majority of them run for less then 1 hour

  • About 24% of the jobs were responsible for 98% of the used CPU compute time and 96% of the used GPU compute time

  • There were 383 unique users active on Hydra in 2024 and 237 new VUB VSC accounts were created

Most important changes in 2024:

Users#

There were 383 unique users of Hydra in 2024 (i.e. submitted at least one job), compared to 349 in 2023:

  • 363 unique users of the CPU nodes (2023: 327)

  • 130 unique users of the GPU nodes (2023: 110)

Split of all users by employment type:

Employment Type

Users

Relative

Guest professor

5

1.3%

Professor

10

2.6%

Administrative/technical staff

17

4.4%

Non-VUB

23

6.0%

Student

97

25.3%

Researcher

231

60.3%

The distribution per month shows the most busy periods of the year:

../../../_images/active_users_per_month_2024.png

We saw a larger increase in VSC accounts for VUB on 2024: 237 new accounts were created (compared to 177 on 2023). This increase shows that the effort to recruit more users starts to pay off.

../../../_images/vsc_accounts_per_year_2025.png

CPU Usage#

On average, the load on the cluster was 61%. The usage pattern matches that of 2023, i.e. regular large fluctuations in use that fit a weekly cycle.

Tip

Usage on weekends is systematically low, so if you are in rush to get your jobs started quickly that is the best time to do it.

The month of September stands out: usage was low, yet the number of active users was high.

../../../_images/xdmod_CPU_Usage_Hydra_2024.png

About 80% of the used CPU compute time comes from just 10% of the users.

Split of used CPU compute time per VSC institute:

Institute

Usage

VUB

99.1%

UAntwerpen

0.7%

KULeuven

0.2%

UGent

0.1%

For the remainder of the analysis we only consider jobs which ran for at least one hour, shorter jobs are assumed to be tests or failed jobs. These jobs represent 98% of the used compute time.

Profile of a typical CPU job split in percentiles of the number of jobs:

Percentile

Cores

Nodes

Walltime

0.500

1

1

0 days 04:12:50

0.750

1

1

0 days 09:27:37

0.800

2

1

0 days 10:13:38

0.900

8

1

0 days 14:06:10

0.950

16

1

1 days 08:07:08

0.990

40

2

3 days 13:17:12

0.999

128

10

5 days 00:00:22

Note

A typical CPU job (90% of all jobs) runs on a single node, uses 8 or less cores and ends in 14 hours.

If we look at the number of used cores as function of the used CPU compute time by those jobs, we see that this matches the previous percentiles based on number of jobs:

../../../_images/Used_CPU_Time_split_by_the_number_of_used_CPUs1.png

The number of used nodes as function of the used CPU compute time also matches:

../../../_images/Used_CPU_Time_split_by_the_number_of_used_nodes1.png

In conclusion:

  • There were many small jobs: 61% of the jobs ran on a single node and almost 50% on 28 or less cores.

  • There were many short jobs: 90% ran less than 14 hours, 50% even less than 4 hours.

Compared to 2023 we see that jobs got bigger but the job duration decreased.

As the load in the cluster was not that high, the queuing time for jobs was short for the large majority of jobs (>90%):

Percentile

Total

Single core

Multicore

Single Node

Multi Node

0.50

0 days 00:00:46

0 days 00:00:50

0 days 00:00:14

0 days 00:00:46

0 days 00:23:04

0.75

0 days 00:27:59

0 days 00:07:12

0 days 04:07:00

0 days 00:24:37

0 days 08:52:24

0.80

0 days 01:32:08

0 days 00:53:32

0 days 07:36:44

0 days 01:27:50

0 days 13:06:31

0.90

0 days 05:06:58

0 days 03:43:43

0 days 20:03:17

0 days 04:58:31

1 days 16:03:45

0.95

0 days 09:17:42

0 days 05:48:03

1 days 18:23:10

0 days 08:43:28

3 days 19:03:09

0.99

1 days 19:29:37

0 days 12:08:49

5 days 04:22:09

1 days 12:20:00

6 days 10:52:09

The column Total shows the queuing time per percentile for all CPU jobs while the other columns show the queuing time for each type of job.

Queuing time depends on the overall load of the cluster and the resources requested by the job. The more resources requested, the higher the queuing time for that job. However, we see that 50% of all jobs start immediately, regardless of the resources requested. Even including multi-node jobs (the largest), 90% of all jobs start within 5 hours, 80% even within just 1.5 hours.

The previous queuing times are averages for the entire year, which will flat out very busy moments when users can experience much longer queuing times. The following graph splits out queue times on a monthly basis:

../../../_images/Percentiles_for_queue_time_per_month_for_CPUs1.png

The month of October stands out due to its higher queuing time for the 80% and 90% percentiles. There is no external cause for this though, cluster capacity was nominal, so it was just a busier month than usual.

Due to the large quantity of short jobs submitted to the cluster, we see that the queuing time on Saturday and Sunday is significantly shorter than during the rest of the week (about 50% shorter).

If we look at the active users per month, we see no surprises: April and November are the months with most people on the cluster:

../../../_images/Active_CPU_users_per_month1.png

Despite the higher number of users, the queuing time in April does not stand out though.

Used compute time by faculty:

Faculty

Usage

Faculty of Social Sciences and Solvay Business School

0.3%

Department ICT

0.8%

Non-VUB

0.9%

Faculty of Medicine and Pharmacy

1.3%

students

14.8%

Faculty of Engineering

16.1%

Faculty of Sciences and Bioengineering Sciences

65.7%

We only show those faculties that use at least 0.1% of the total compute time of the year. As expected, the Faculties of Sciences and (Bio-)Engineering use the largest share of the CPU compute time.

The overview of used compute time per department reveals the actual use of the cluster per scientific domain:

Department

Usage

Electronics and Informatics

0.1%

Supporting clinical sciences

0.2%

Sociology

0.2%

Clinical sciences

0.3%

Department of Bio-engineering Sciences

0.5%

Pharmaceutical and Pharmacological Sciences

0.7%

Administrative Information Processing

0.8%

Non-VUB

0.9%

Electrical Engineering and Power Electronics

1.0%

Applied Physics and Photonics

1.1%

Engineering Technology

1.2%

Department of Water and Climate

2.4%

Biology

2.7%

Geography

3.1%

Applied Mechanics

4.8%

Materials and Chemistry

5.5%

Physics

5.6%

Informatics and Applied Informatics

14.4%

students

14.8%

Chemistry

39.4%

Used compute time split over the different employment types of the users:

Employment Type

Usage

Guest professor

0.5%

Non-VUB

0.9%

Professor

1.5%

Administrative/technical staff

6.0%

Students

14.7%

Researcher

76.3%

GPU Usage#

The load on the GPUs of the cluster was 75% on average, with some periods during the year of low usage during and others where all GPUs were constantly in use.

../../../_images/xdmod_GPU_Usage_Hydra_2024.png

About 80% of the used GPU time comes from just 10% of the users. Similarly to what we see on CPUs.

Split of used GPU time per VSC institute:

Institute

Usage

VUB

97.4%

UAntwerpen

0.0%

KULeuven

2.5%

UGent

0.1%

For the remainder of the analysis we only consider jobs which ran for at least one hour, shorter jobs are assumed to be tests or failed jobs. These jobs represent 96.5% of the used GPU time.

Profile of a typical GPU job split in percentiles of the number of jobs:

Percentile

GPUs

Nodes

Walltime

0.500

1

1

0 days 05:11:24

0.750

1

1

0 days 10:39:58

0.800

1

1

0 days 13:05:19

0.900

1

1

1 days 00:00:09

0.950

1

1

1 days 12:00:13

0.990

2

1

4 days 04:20:02

0.999

3

3

5 days 00:00:19

Note

A typical GPU job (90% of all jobs) uses a single GPU and runs for less than 1 day.

If you look at the number of GPUs used as function of the used GPU time by those jobs, we see that this matches the previous percentiles based on number of jobs:

../../../_images/Used_GPU_Time_split_by_the_number_of_used_GPUs1.png

In conclusion:

  • The large majority of GPU jobs (91%) uses a single GPU

  • There are many short GPU jobs: 80% of jobs run for less than 13 hours, 50% for less than 5.2 hours

Compared to 2023, the dominance of single GPU jobs has increased while the job duration has decreased.

The number of GPU resources in Hydra is a lot smaller than its CPU resources, which is reflected on the longer queuing time observed on GPUs. Moreover, we see a general uptake in the use of AI/ML techniques and thus an increasing demand for GPU resources.

Percentile

Total

Single GPU

Multi GPU

0.500

0 days 03:16:24

0 days 03:23:54

0 days 00:15:46

0.750

0 days 11:43:20

0 days 11:54:03

0 days 05:59:34

0.800

0 days 14:40:07

0 days 14:47:51

0 days 08:52:30

0.900

0 days 23:56:34

0 days 23:59:24

0 days 19:09:27

0.950

1 days 10:24:31

1 days 10:28:22

1 days 07:31:23

0.990

3 days 00:40:18

3 days 03:53:17

2 days 10:01:51

0.999

14 days 05:49:23

14 days 06:19:27

5 days 00:17:31

The column Total shows the queuing time per percentile for all GPU jobs while the other columns show the queuing time for each type of job.

The load on the GPU nodes was a lot higher compared to 2023 and this translates into longer queue times. Nonetheless, 50% of GPU jobs start within a couple of hours from launch and 90% start within a day.

The previous queuing times are averages for the entire year, which will flat out very busy moments when users can experience much longer queuing times. The following graph splits out queue times on a monthly basis:

../../../_images/Percentiles_for_queue_time_per_month_for_GPUs1.png

As expected, during the busy periods on the cluster, the queuing time is higher but on average the 50% percentile queue time is still low.

As in 2023, and unlike for CPU jobs, we don’t clearly see a weekend effect in the queuing time of GPUs.

Looking at the active users per month, we see that the second part of the year had less active GPU users:

../../../_images/Active_GPU_users_per_month1.png

Used GPU time by faculty:

Faculty

Usage

Department ICT

1.5%

non-VUB

2.7%

Faculty of Medicine and Pharmacy

7.7%

Faculty of Social Sciences and Solvay Business School

15.9%

Faculty of Engineering

19.0%

students

22.0%

Faculty of Sciences and Bioengineering Sciences

31.3%

We only show those faculties that use at least 0.1% of total compute time. Here we can clearly see the growing importance of AI/ML in all domains of science. The medical and social sciences are small users of the CPU resources but big users of the GPU resources, like in 2023.

The overview of used compute time per department reveals the actual use of the GPUs per scientific domain:

Department

Usage

Chemistry

0.1%

Basic (bio-) Medical Sciences

0.1%

Geography

0.1%

Applied Physics and Photonics

0.4%

WIDSWE

0.8%

Department of Water and Climate

1.0%

Electrical Engineering and Power Electronics

1.2%

Administrative Information Processing

1.5%

Engineering Technology

1.7%

non-VUB

2.7%

Pharmaceutical and Pharmacological Sciences

7.5%

Department of Bio-engineering Sciences

13.8%

Electronics and Informatics

14.7%

Business technology and Operations

15.9%

Informatics and Applied Informatics

16.5%

students

22.0%

Used compute time split over the different types of personnel:

Employment Type

relative

Professor

0.2%

Administrative/technical staff

1.5%

non-VUB

2.7%

students

21.9%

Researcher

73.8%

Software Usage#

In 2024, there were 5,694 modules available across all the partitions of Hydra which represent 1,137 unique software installations and 3,482 unique extensions. The following chart shows the most used software by users based on modules loaded directly on the terminal or in their jobs.

../../../_images/top10_module_loads.png

Unsurprisingly, Python rules: the top 5 software modules are all packages in the Python ecosystem.

Notebook Usage#

The HPC notebook platform launched in 2023 consolidated itself in 2024 as a major portal to access Hydra by our users.

In total, 5509 sessions were started in 2024 on notebooks.hpc.vub.be. This is equivalent to roughly 15 users on the notebook platform every single day, double of what we saw in 2023.

Usage of the Jupyter environment:

Notebook Environment

Sessions

Percentage

Default: minimal with all modules available

4012

72.8%

DataScience: SciPy-bundle + matplotlib + dask

1083

19.7%

Molecules: DataScience + nglview + 3Dmol

89

1.6%

RStudio with R

206

3.7%

MATLAB

119

2.2%

The star of the notebook platform are the tools for Python, as we also see in the usage of regular software modules. The DataScience environment gathers a fifth of all loads, while most users prefer launching the Default environment and manually loading their own software modules.

User support#

We received 814 support requests (incidents) from users, which is a substantial increase compared to 2023 where there were 714 support requests. Distribution per month follows the academic year and the load on the system:

../../../_images/SDC_tickets_per_month_2024.png

Distribution per service provided by the SDC team:

Business service

Incident Count

Percentage of Incidents

HPC Scientific software installation

215

27.0%

HPC jobs troubleshooting

166

20.8%

HPC data

116

14.6%

HPC consultancy

96

12.0%

Pixiu scientific research data

91

11.4%

HPC VSC accounts & access

80

10.0%

HPC Tier-0 & Tier-1 projects advice

32

4.0%

HPC Tier-2 hardware co-investment

1

0.1%

The majority of incidents (~48%) are related to problems with jobs or requests for new software installations.

We managed to resolve the large majority of the incidents within one work day:

../../../_images/SDC_duration_incidents_2025.png

Tier-1#

There were 3 calls for projects in 2024 for VSC Tier-1 (Hortense). In total 24 starting grants were requested by VUB researchers and 11 full projects were submitted to those 3 calls: 9 of those submissions were accepted, which translates into a 82% success rate.

Usage of CPU Hortense in 2024:

../../../_images/xdmod_Hortense__CPU_Usage.png

VUB researchers used about 9.9% of the CPU compute time of the academic usage of the Tier-1.

Usage of GPU Hortense in 2024:

../../../_images/xdmod_Hortense__GPU_Usage.png

VUB researchers used about 6.2% of the GPU compute time of the academic usage of the Tier-1.

VSC User survey#

At the end of 2024 the second edition of the VSC-wide user survey was run. This survey encompasses all VSC services (all Tier-1 components and all Tier-2 systems).

See also

The results of the 2023 survey can be found on Overview of 2023

The invite for the survey was sent out on 5 November 2024 and the survey closed on 17 December 2024. In total 555 people responded and 411 of those completed the survey in full, which is a 10% increase compared to 2023. There were 54 people affiliated with the VUB among the respondents (much appreciated!).

../../../_images/VUB-survey1.png

Looking at the answers of those respondents who said they were using VUB’s Tier-2 (Hydra), we get the following percentages of people who rated that service as Excellent or Good (discarding answers that said No Experience):

Quality

98.1% (51/52)

Response time

98.0% (49/50)

Correspondence to expectations

88.7% (47/53)

Software installations

92.9% (39/42)

Availability

92.6% (50/54)

Maintenance communication

94.0% (47/50)

Documentation

89.1% (49/55)

Getting started

83.6% (46/55)

In general we are happy with this result as it shows that our services are very positively valued by our users. Like last year, Getting started and our Documentation was rated as the least good point.

Regarding the documentation we started with a re-design in 2023 of the VSC docs website and in 2024 we continued with integrating our own documentation into it. The goal is to gather in a single place all the documentation related to our services.

Note

We are open to feedback on how we can improve the documentation. There is a constant effort in making the documentation both up to date and complete. So if you find anything wrong or have suggestions, please contact VUB-HPC Support

Getting started on Hydra is not easy as there is quite a learning curve. With training and documentation we try to lower the curve as much as possible. We hope that the new Open Ondemand portal will improve the situation. Our plan is that this portal will become the default way of accessing the cluster.

In the free comments section of the survey we found the following three main suggestions and requests:

  • More GPUs

  • Queuing time is perceived as too long

  • Add a Open OnDemand platform

We are aware that the GPUs are in high demand and the queuing time on them can be longer than tolerable at times. We are trying to actively push the biggest GPU users to the Tier-1 and we are working on buying more GPUs for Hydra.

Regarding Open OnDemand, this year 2025 we launched our new VUB portal based on Open OnDemand: New OnDemand web portal for HPC.

About advanced or application specific training, if you have any ideas on topics for such a training, please let us know at VUB-HPC Support.

Finally, we want to finish this report with several very positive comments that we spontaneously received in the survey:

“I am very impressed by how the HPC team listens to the users’ requirements. They were very quick to implement a test partition for GPU jobs which can enhances the workflow.”

“The user manual is very detailed and highly usable.”

“Sometimes the queue time is very long. However i want to emphasize that the HPC-team at the VUB are super helpfull. They are always nice and take their time to explain things to me.”

“They are always very friendly and helpful, very nice to work with them!”