As of today, over 229,300 jobs have successfully run in the Slurm scheduler. Given the stability and flexibility of the new scheduler, we are consolidating the Condor system into the Slurm. This means that jobs you previously submitted to Condor, you will now submit to a HPC partition named Condor.
Our Systems Team has been working hard today to restore the Lustre storage service. As of 4:45pm today, the Lustre system is online but in recovery mode. It is currently working on Spear nodes, but not yet on export nodes.
This means that Spear is now online. Lustre access from the HPC or from other systems is not functional yet, though. We will keep you posted as to progress.
Recently, we noticed a substantial amount of nodes crashing, causing job failures. We have been investigating this issue and have determined that the problem is related to memory issues. Jobs have been filling up all available RAM and swap partitions. Under Moab and RHEL 6.5, this issue did not show up, since offending jobs would get killed by the Linux kernel. Currently with the new scheduler, these jobs cause compute nodes to crash and reboot. Any running job on those nodes will fail without any meaningful error message sent to users ("node failure").