Oct 10, 2022 Power Outage in Sliger

UPDATE 10/11 @ 10:15am:

Message from RCC Director, Paul van der Mark:

Dear Sliger colocation customers and RCC users,

Yesterday, we experienced an unplanned power outage in our Sliger data center. It was an unfortunate fluke of standard maintenance and an undocumented feature. During our renovation last spring, the contractor installed a new fire-suppression system. In addition, a new safety feature was added, connecting that system with the emergency power-off switch on our UPS. However, the contractor did not add that feature to the wiring diagrams, so when Orr Protection performed a routine test on the new system, it turned off the UPS and thereby turned off power in the server room.

RCC staff returned power to most of the colocation customers within 15 minutes. But unfortunately, many customers still had to come to the data center to reset or turn on their equipment. Because of its complexity, the RCC HPC system took several hours to [become] fully operational.

In the afternoon of Monday, October 10th, FSU's Department of Environmental Health and Safety put a permanent fix in place for the issue. We, therefore, are confident that this was a unique occurrence. We are genuinely sorry for any inconvenience this has caused.

Best Regards,

Paul van der Mark, PhD
Director, Research Computing Center
Information Technology Services
Florida State University
Phone: 850.644.0193
its.fsu.edu | rcc.fsu.edu


UPDATE 10/10 @ 3pm:

We are pleased to report that all RCC services are back online. Thanks for being patient while the Systems Team worked to bring everything back. This includes the following services:
  • The HPC and Spear clusters are online, and the Slurm scheduler is accepting jobs.
  • The "/hpc" VPN profile for students, guests, and any other non-staff members is up.
  • Open OnDemand is up.
  • The self-service web portal and webservices (RCCTool) are up.
  • All RCC managed customer VMs and other hosted systems are up.
  • Globus is up.
  • Our storage export servers are up.

If you had jobs running before this morning, you will need to resubmit them.

The power was out from approximately 9am to 9:30am. Because it was unplanned and unexpected, it took about five hours to bring all RCC services back online.
What caused the outage: The Orr Protection company performed a standard, periodic inspection of the fire suppression system. This is a standard procedure, but it was the first time after the Sliger renovation that finished last August. This time, the test triggered the Emergency Power Off (EPO) on the UPS. The connection between the fire suppression system and the EPO was established during the renovation but had not been documented.
The root cause has been identified and remedied. If you have any questions or notice anything that isn't working, please let us know: support@rcc.fsu.edu.

UPDATE 10/10 @ 12:45pm:

We are making progress restoring service.


  • The HPC and Spear clusters are online, and the Slurm scheduler is accepting jobs.


  • The "/hpc" VPN profile for students, guests, and any other non-staff member is still down.
  • Open OnDemand is down
  • The self-service web portal and webservices (RCCTool) are down.
  • All RCC managed customer VMs are down.
  • Globus is down.
  • Our export servers are down.

At approximately 9:05AM this morning, the Sliger Datacenter suffered a power outage. The power is back online, but all RCC systems were affected. We are working to bring everything up as quickly as possible. Colocation customers were affected as well.

More details to come. We will update this page throughout the day until everything is back online.

If you have any specific systems that you need addressed, please email support@rcc.fsu.edu.