Slurm Controller Issues

UPDATE (9:15am - Tue, July 25) - Slurm issues are resolved.  We are continuing to monitor the system today in case we see any residual problems.


UPDATE (7:35pm) - Slurm issues persist, and job submissions are currently not working.  Currently running jobs will continue to run, but you may not get email notices when they complete or fail.  Our System Team will investigate the issue as soon as service hours resume in the morning.

We believe that the bug is related to a configuration error where jobs that were submitted during a brief time window cause the scheduler to crash.  While we know what the approximate cause of the issue is, we are not entirely sure of the time window, and we do not want to cancel all jobs that were submitted. 

Thanks very much for your patience while we work through the issue.  Please direct any questions or concerns to support@rcc.fsu.edu


UPDATE (2:40pm) - Slurm is operational.  The cause of the crash was related to some jobs that had been submitted during a brief time window when we had erroneous configuration values set.  We are monitoring to ensure that no further jobs kill the controller.  If you experience an issue submitting a job during this time period, wait a few seconds, and then try again.


We are seeing issues with our Slurm controller.  The Systems Team is investigating it.