Big upgrade for Hydra the 27-28 November 2023#

../../../_images/hpc-cables-01.jpg

We are happy to announce a big upgrade to our beloved tier-2 HPC cluster in VUB. On Monday, November 27th 2023 at 03:00 (CET), Hydra will be shut down for 48 hours to apply a major system upgrade, renew the scratch storage and add some extra compute nodes.

It will not be possible to connect to Hydra in any way by our users during the operation. Your jobs will not start if they cannot end before November 27th at 03:00 (CET). Jobs in queue will stay in queue. Job scheduling will resume automatically after the upgrade.

Requests for new software installations in Hydra will be frozen on November 6th and resumed once this upgrade operation is completed. Those requests will be carried out in the new operating system.

Operating system and software#

The OS of the cluster will be upgraded to Rocky Linux 8.8 from the current CentOS Linux 7.9. The system kernel and all core libraries will receive major upgrades to ensure that the cluster can keep its high performance and security standards.

This major upgrade will be a breaking point for software installations in Hydra though. All software modules currently available on toolchains foss/2022a, intel/2022a and GCCcore/11.3.0 will be fully re-installed on the new system and you will be able to use those software modules as usual.

Older software will be kept in the system as-is and it will be loadable through a special module. Unfortunately, we cannot guarantee its performance or functionality. If you must continue using older software in the cluster we will carry out re-installations on demand, please contact VUB-HPC Support with your request.

Job scheduler#

The version of Slurm in Hydra will be upgraded to it’s latest release which brings many fixes and improvements. Nonetheless, this update does not carry any major changes for the users and it will not affect the usage of Slurm command line tools (i.e. sbatch, srun, scancel…) or the #SBATCH directives in jobs.

However, with this operation we will remove the compatibility layer in Hydra with PBS/torque, the previous job scheduler of Hydra. It has already been 2 years since the migration to Slurm and we have observed that the vast majority of users are using native Slurm commands and job scripts. Therefore, supporting the old commands is no longer justified. If you still rely on the old qsub and qstat commands in Hydra, or #PBS directives in your job scripts, be sure to update your workflow before the date of this migration. You can find all the information in our documentation at Torque/Moab to Slurm migration and if you need help with the conversion we can also help you. Please contact VUB-HPC Support.

New scratch storage#

Hydra is getting a brand new storage system for VSC_SCRATCH! The current scratch storage in Hydra has covered its expected lifetime with excellent performance and reliability and it is time for a renewal. Hydra is getting a new scratch that will be connected to the cluster with InfiniBand EDR and Ethernet 100 Gbit/s, both are very fast network connections that ensure the best speed of the storage on all compute partitions.

The main novelty of the new scratch is that the storage will be composed of 2 types of drives:

  • 592 TB of storage on HDDs in a redundant RAID6 configuration: slightly slower than current scratch but with more storage space

  • 16 TB of storage in very fast NVMe SSDs in a redundant RAID6 configuration: much faster than current scratch, this SSD storage will work as a kind of cache for file operations on VSC_SCRATCH and jobs will automatically benefit from it whenever new files are generated or transferred into the scratch

New compute nodes#

It is time to say goodbye to the old Ivybridge nodes in the cluster. The hardware of these nodes is no longer up to standard in terms of energy efficiency and supporting software on their old CPU micro-architecture has become increasingly difficult. We are replacing them with Skylake nodes that have been recycled from the recently decommissioned tier-1 cluster Breniac. These new nodes have 28 CPU cores with 192 Gb of memory and fast InfiniBand network. They will be incorporated into the existing partition skylake_mpi.

New sister cluster#

Our notebook platform has experienced a very good reception and growth in usage. We have observed that more and more users are interested to run notebooks on GPUs and that they are experiencing difficulties as the Nvidia Pascal GPUs available on the notebook platform are less available. Therefore, we are adding a new sister cluster to Hydra called Anansi that is specifically tailored for interactive workflows on the notebook platform.

This new cluster will be first available on a pilot phase and then added to the available options on the notebook platform. The compute resources of Anansi are shared between users to guarantee a quick start of notebook sessions on it. It has 4 GPUs virtually cut into 16 smaller GPUs that are ideal for visualization workloads. You will be able to select Anansi from the main interface of the notebook platform alongside existing options to run on dedicated resources in Hydra, such as the Skylake nodes or the Nvidia Pascal GPUs.

We will share more information about Anansi once the maintenance operation is completed. Stay tuned.