Hydra upgrade 27-28 November finished#

../../../_images/hpc-cables-02.jpg

Hydra has gained a major system upgrade, a brand new scratch storage, extra compute nodes, and a new GPU cluster for interactive workflows.

As detailed below, this big upgrade comes with some important changes for our users.

Operating system and software#

The OS of the cluster has been upgraded from CentOS Linux 7.9 to Rocky Linux 8.8. The system kernel and all core libraries have received major upgrades to ensure that the cluster can keep its high performance and security standards.

Importantly, this major upgrade is a breaking point for software installations in Hydra. All software modules currently available on toolchains foss/2022a and intel/2022a (including GCCcore/11.3.0), have been fully re-installed on the new system and you will be able to use those software modules as usual. Please contact VUB-HPC Support if any software you need from this generation is missing or not working as expected.

Older software will be kept in the system as-is and it will be loadable through the legacy-software module. This module will be loaded by default in the compute nodes (but not in the login nodes) until the end of the year. Although many older software installations will continue to work, we cannot guarantee their performance nor functionality. We therefore strongly suggest switching to the 2022a software generation or newer. If you must continue using older software that is no longer working, we will provide a CentOS-7 container image or, as a last resort, carry out re-installation, please contact VUB-HPC Support for help.

Job scheduler#

The Slurm job scheduler in Hydra has been upgraded to a more recent version, bringing many fixes and improvements. Luckily, this update does not carry any major changes for the users and it will not affect the usage of Slurm command line tools (i.e. sbatch, srun, scancel…) or the #SBATCH directives in jobs.

As part of the Slurm upgrade, we have deprecated the PBS/torque compatibility layer in Hydra. Users who still need to use the old qsub and qstat commands and #PBS directives in their job scripts, should first load the slurm-torque-wrappers module in their environment. However, we urge all users to make use of the native Slurm commands from now on. All info is available at Torque/Moab to Slurm migration, and we can also help you with he conversion if needed. Please contact VUB-HPC Support.

New scratch storage#

Hydra has been upgraded with a brand new scratch storage system (VSC_SCRATCH). The old scratch has served us well for many years with excellent performance and reliability. The new scratch is connected to the cluster with InfiniBand EDR and Ethernet 100 Gbit/s, both are very fast network connections that ensure the best speed of the storage on all compute nodes.

The main novelty of the new scratch is that the storage is composed of 2 types of drives:

  • 592 TB of storage on HDDs in a redundant RAID6 configuration: slightly slower than current scratch but with more storage space.

  • 16 TB of storage in very fast NVMe SSDs in a redundant RAID6 configuration: much faster than current scratch, this SSD storage works as a kind of cache for file operations on VSC_SCRATCH and jobs will automatically benefit from it whenever new files are generated or transferred to the scratch.

New compute nodes#

We’re saying goodbye to the very old Ivybridge nodes in the cluster. We’ve replaced them with 16 extra Skylake nodes that were recycled from the recently decommissioned Tier-1 cluster, Breniac. These new nodes have 28 CPU cores with 192 Gb of memory and fast InfiniBand network. They have been incorporated into the existing skylake_mpi partition.

New sister cluster#

Our notebook platform has experienced a very good reception and growth in usage. We have observed that more and more users are interested to run notebooks on GPUs and that they are experiencing difficulties as the Nvidia Pascal GPUs available on the notebook platform are very busy. Therefore, we are adding a new sister cluster to Hydra, called Anansi, that is tailored for interactive GPU-based workflows on the notebook platform.

This new cluster will very soon be available in a pilot phase, and will later be added to the available options on the notebook platform. We will soon share more information about Anansi. Stay tuned.

Enhanced security#

From now on, users can no longer directly use ssh to log into a node where their job is running. The only way to do this is via srun, as explained in the FAQ on monitoring jobs. Using srun is more secure, as access is limited to the resources allocated to the job.