Over the past few days, we've noticed a large number of jobs failing on the HPC when they are assigned to a certain set of nodes.
The affected nodes are all nodes in the D30 and D31 racks, and possibly the nodes in the D32 rack. (hpc-d30..., hpc-d31..., hpc-d32...).
Errors typically look like the following:
[HPC] Your job 1234567 is in state NODE_FAIL Your HPC Job 1234567 entered the following state: NODE_FAIL
You may also see errors like this:
srun: error: Node failure on hpc-d32-1-3 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** JOB 123457 ON hpc-d32-1-1 CANCELLED AT 2018-06-18T09:35:12 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
We are working on identifying the problem and preparing a solution to this problem, and we will post further updates as soon as we have more information about what is causing the issue.