Recently, we noticed a substantial amount of nodes crashing, causing job failures. We have been investigating this issue and have determined that the problem is related to memory issues. Jobs have been filling up all available RAM and swap partitions. Under Moab and RHEL 6.5, this issue did not show up, since offending jobs would get killed by the Linux kernel. Currently with the new scheduler, these jobs cause compute nodes to crash and reboot. Any running job on those nodes will fail without any meaningful error message sent to users ("node failure").
UPDATE 7pm: The HPC issues are resolved. Thanks for your patience.
We are currently experiencing issues on the HPC where Slurm commands are not responding. Our Systems Team is working to restore the service, and we will keep you posted as soon as we resolve the issue.
UPDATE: Fri, Jul 24 - 9:00pm - We have completed compiling and redeploying the new version of OpenMPI. All systems are now running OpenMPI v1.8.7.
We have just been notified of a major memory leak with OpenMPI v1.8.6 (the current version on the HPC). This is a likely reason that many nodes have been crashing and disrupting jobs on the HPC this week.