UPDATE - Wed, Nov 28 - 12:30pm - This issue is now RESOLVED. We appreciate your patience.
Our vendor rush-shipped replacement parts, and our systems team installed them today. All affected nodes and partitions are running at full capacity again.
In addition, we have added automated checks to our monitoring system to proactively notify staff when this particular issue occurs again. This also ensures that nodes which are experiencing network issues do not accept any jobs.
We are having issues with an IB switch that is affecting approximately 20 nodes in the HPC cluster. Some partitions that have been affected include eoas_q, genacc_q, backfill, backfill2 and a few others.
We have taken the broken nodes offline, so any new jobs that you submit should run without error, but they may take slightly longer to start.
Jobs that were affected by this typically fail with error messages similar to the following:
[slurmstepd: error: *** 319418.0 ON hpc-***-1-1 CANCELLED AT 2018-11-25T13:52:32 *** [hpc-***-1-1] too many retries sending message to 0x005e:0x0000328a, giving up srun error: hpc-d32-1-2: task 3: Killed
We are aware of the issue and working on a resolution. Thanks much for your patience. We will post another notice as soon as we have a resolution or update.