Hydra upgrade 27-28 November finished#
Hydra has gained a major system upgrade, a brand new scratch storage, extra compute nodes, and a new GPU cluster for interactive workflows.
As detailed below, this big upgrade comes with some important changes for our users.
Operating system and software#
The OS of the cluster has been upgraded from CentOS Linux 7.9 to Rocky Linux 8.8. The system kernel and all core libraries have received major upgrades to ensure that the cluster can keep its high performance and security standards.
Importantly, this major upgrade is a breaking point for software installations
in Hydra. All software modules currently available on toolchains foss/2022a
and intel/2022a
(including GCCcore/11.3.0
), have been fully re-installed
on the new system and you will be able to use those software modules as usual.
Please contact VUB-HPC Support if any software you need from this generation is missing
or not working as expected.
Older software will be kept in the system as-is and it will be loadable through
the legacy-software module. This module will be loaded
by default in the compute nodes (but not in the login nodes) until the end of
the year. Although many older software installations will continue to work, we
cannot guarantee their performance nor functionality. We therefore strongly
suggest switching to the 2022a
software generation or newer. If you must
continue using older software that is no longer working, we will provide a
CentOS-7 container image or, as a last resort, carry out re-installation, please
contact VUB-HPC Support for help.
Job scheduler#
The Slurm job scheduler in Hydra has been upgraded to a more recent version,
bringing many fixes and improvements. Luckily, this update does not carry
any major changes for the users and it will not affect the usage of Slurm
command line tools (i.e. sbatch
, srun
, scancel
…) or the
#SBATCH
directives in jobs.
As part of the Slurm upgrade, we have deprecated the PBS/torque compatibility
layer in Hydra. Users who still need to use the old qsub
and qstat
commands and #PBS
directives in their job scripts, should first load the
slurm-torque-wrappers
module in their environment. However, we urge all
users to make use of the native Slurm commands from now on. All info is
available at Torque/Moab to Slurm migration, and we can
also help you with he conversion if needed. Please contact VUB-HPC Support.
New scratch storage#
Hydra has been upgraded with a brand new scratch storage system
(VSC_SCRATCH
). The old scratch has served us well for many years with
excellent performance and reliability. The new scratch is connected to the
cluster with InfiniBand EDR and Ethernet 100 Gbit/s, both are very fast network
connections that ensure the best speed of the storage on all compute nodes.
The main novelty of the new scratch is that the storage is composed of 2 types of drives:
592 TB of storage on HDDs in a redundant RAID6 configuration: slightly slower than current scratch but with more storage space.
16 TB of storage in very fast NVMe SSDs in a redundant RAID6 configuration: much faster than current scratch, this SSD storage works as a kind of cache for file operations on
VSC_SCRATCH
and jobs will automatically benefit from it whenever new files are generated or transferred to the scratch.
New compute nodes#
We’re saying goodbye to the very old Ivybridge nodes in the cluster. We’ve
replaced them with 16 extra Skylake nodes that were recycled from the recently
decommissioned Tier-1 cluster, Breniac. These new nodes have 28 CPU cores with
192 Gb of memory and fast InfiniBand network. They have been incorporated into
the existing skylake_mpi
partition.
New sister cluster#
Our notebook platform has experienced a very good reception and growth in usage. We have observed that more and more users are interested to run notebooks on GPUs and that they are experiencing difficulties as the Nvidia Pascal GPUs available on the notebook platform are very busy. Therefore, we are adding a new sister cluster to Hydra, called Anansi, that is tailored for interactive GPU-based workflows on the notebook platform.
This new cluster will very soon be available in a pilot phase, and will later be added to the available options on the notebook platform. We will soon share more information about Anansi. Stay tuned.
Enhanced security#
From now on, users can no longer directly use ssh
to log into a node where
their job is running. The only way to do this is via srun
, as explained in the
FAQ on monitoring jobs. Using srun
is more secure,
as access is limited to the resources allocated to the job.