NEMO Network Problems
Published: 11 May 2023
by
HPC Team Freiburg
A damaged network switch caused 88 nodes to fail today. The computing nodes were rebooted and the running jobs on these nodes were terminated. The problem has been fixed in the meantime.
UPDATE: 06/02/2023: Another damaged switch caused 44 nodes to crash today. The problem has been fixed.
After so many years this is to be expected, so in the next few weeks we will try to check the remaining switches in the cluster and replace them in advance.
A damaged network switch caused 88 nodes to fail today. The computing nodes were rebooted and the running jobs on these nodes were terminated. The problem has been fixed in the meantime.
UPDATE: 06/02/2023: Another damaged switch caused 44 nodes to crash today. The problem has been fixed.
After so many years this is to be expected, so in the next few weeks we will try to check the remaining switches in the cluster and replace them in advance.