UPDATE: Power outage in Sliger Data Center from May 27 - June 2

UPDATE; Friday, June 4 @ 4:30pm:

Here is the last update before the weekend (hopefully 🤞).

 

On the HPC, all owner nodes and GPU nodes are online, and they are processing jobs submitted via Slurm. However, part of the cluster will remain offline through the weekend, including most of the general access nodes.

 

Jobs submitted through the backfill2 queue will run, but they may have to wait longer than usual before they start. We are asking for your patience over the weekend, and we will hopefully be able to bring more of the cluster up on Monday.

 

Chilled water cooling has been the hold-up for most of the week. The contractors have one of the two primary chillers online and fully tested. However, we do not believe they will have the other one online by the end of the day. We will continue to run temporary cooling along with the chiller over the weekend for some redundancy.

 

With only one of the two chillers running, there is no backup in case of a failure. The HPC nodes can heat up to room to near-dangerous levels very quickly, and we don't want to take any risks during this weekend. We will, therefore, not bring the whole cluster up until we have reassurance that both chillers are functional.

 

Refer to the image below to see some of the wild temperature swings in the datacenter this past few days:

 

If the contractors can get the second cooling unit online by early next week, we will power up the full HPC as early as Monday. Either way, you can expect another message from the RCC Staff no later than 2pm on Monday.

 

Have a good weekend, and once again, thanks very much for your patience. As usual, comments and inquiries should be sent via email to support@rcc.fsu.edu.


UPDATE; Thursday, June 3 @ 11:30am:

 

All RCC systems are available except for the HPC. This includes the Parallel Filesystem (GPFS) and Archival storage systems, as well as customer VMs, and other auxiliary services.
 
The cooling system in the Sliger Datacenter failed again after running for approximately for four hours yesterday. RCC staff noticed the temperatures starting to rise early yesterday evening (June 2).
 
We notified the vendors, and they are working on it. The good news is that they have identified the source of the problem, a faulty flow switch in the liquid cooling infrastructure.
 
We don't have a specific ETA yet for when we can bring the HPC back online today, especially since cooling failed twice before. Once we have confidence that the cooling issues have been completely resolved, we will post a message to this notice list.
 
In the meantime, RCC staff are taking the opportunity to shore up some infrastructure that we do have control over. In that regard, we have a faulty PDU in the rack that contains the Archival storage system. We may have to turn off the Archival system for about an hour to replace the faulty PDU. If we do that, we will send a notice to this list both when it goes down and when it comes back up.
 
Again, we appreciate your continued patience, and will let you know when we have news sometime today.

UPDATE; Tuesday, June 1 @ 10pm:

We are continuing to bring up RCC services. Currently, our parallel storage system (GPFS) is online, but not yet available to users.
 
We will resume bringing up the rest of the systems starting at 6:30am tomorrow morning. You can expect to receive several updates throughout the day tomorrow until all our services are online. This includes:
  • The HPC cluster
  • The Spear interactive cluster
  • Parallel storage
  • Archival storage
  • Customer VMs and special servers
Thanks very much for your continued patience. We will be monitoring the support@rcc.fsu.edu email address, so if you have any concerns or issues, please let us know by sending an email to that address.

NOTE: Due to an upstream vendor issue, we've had to push the outage back approximately one month; this article has been updated to reflect the changed dates.

As part of the ongoing Sliger Renovation, the contractor (Arbitron-Williams) will be working on the data center's electrical and HVAC systems over the last weekend in May (Memorial Day). Everything hosted in the Sliger Data Center will have to be shut off before that happens, including all RCC systems.

Given the complexity of the task, we will need a full day before the actual outage begins to shut down all RCC systems and a full day to bring them back online after the outage is over. The outage will be for all RCC services, including, but not limited to the following:

  • The High Performance Computing cluster,
  • The Interactive Computing cluster (Spear),
  • GPFS and Archival storage,
  • Virtual Machines running on the "SKY" cluster
  • The "vpn.fsu.edu/hpc" profile for the FSU VPN

The schedule will be as follows:

  • Thursday, May 27; 8am - RCC powers down all RCC systems and will remain offline through Tuesday, June 1
  • Tuesday, June 1; noon - Power and HVAC restored and running
  • Tuesday, June 1; 6pm - Most RCC systems back online (we will send a notice about what's available and what's not)
  • Wednesday, June 2; 5pm - All RCC systems back online

These are the best estimates we can provide at this time, and they may be subject to change between now and when the maintenance occurs. The RCC will make every effort to communicate further schedule changes promptly on our website, newsletter, and our systems notice list.

Co-Location Customers

If your department has equipment in the Sliger Data Center, the power outage will affect your systems as well. We have already reached out to most affected departments but want to make sure everyone is on the same page.

RCC staff will contact you in the coming weeks to determine your specific power needs and work out a schedule for either shutting off your systems or transferring them over to temporary power via a generator.

While temporary generator-supplied power is an option for mission-critical systems, we strongly urge you to shut off non-critical servers. There will be no redundancy in power with UPS backups and no redundant cooling. Running systems on temporary power will be done at your own risk. If you choose to shut off your systems for the outage duration, we will provide protective coverings for your systems or racks to protect them from dust and other construction artifacts.

Whether you choose for your systems to remain on temporary power or not, a technician from your department will most likely need to be on site at the Sliger data center both before and after the maintenance.

The schedule for co-location customers is as follows:

  • Thursday, May 27; 8am - Temporary power generator online; RCC staff on-site to assist departments in powering off or transferring their systems onto the generator
  • Friday, May 28; 8am - Tues, June 1 - Downtime
  • Tuesday, June 1; noon - Power and HVAC restored and running; RCC staff on-site all afternoon to assist departments with switching equipment back to permanent power
  • Wednesday, June 2; 8am - 5pm - RCC staff on site all day to assist departments transition remaining equipment back to permanent power

What's being done

The reason for this power outage is to complete necessary upgrades to power and cooling infrastructure as part of the Sliger Renovation project scheduled to end in August 2021. The Sliger Building is decades old, and FSU has committed two million dollars to bring the infrastructure up-to-code.

While most of the improvements will be behind the scenes, there are a few notable upgrades; some highlights include:

  • Enhanced power and cooling infrastructure to support RCC systems and co-location customers
  • Dedicated switch and increased network speed delivered to racks
  • Replacement of original fire suppression system

For questions, please reach out to us

Contact us at support@rcc.fsu.edu.