UPDATE; Friday, June 4 @ 4:30pm:
Here is the last update before the weekend (hopefully 🤞).
On the HPC, all owner nodes and GPU nodes are online, and they are processing jobs submitted via Slurm. However, part of the cluster will remain offline through the weekend, including most of the general access nodes.
Jobs submitted through the backfill2 queue will run, but they may have to wait longer than usual before they start. We are asking for your patience over the weekend, and we will hopefully be able to bring more of the cluster up on Monday.
Chilled water cooling has been the hold-up for most of the week. The contractors have one of the two primary chillers online and fully tested. However, we do not believe they will have the other one online by the end of the day. We will continue to run temporary cooling along with the chiller over the weekend for some redundancy.
With only one of the two chillers running, there is no backup in case of a failure. The HPC nodes can heat up to room to near-dangerous levels very quickly, and we don't want to take any risks during this weekend. We will, therefore, not bring the whole cluster up until we have reassurance that both chillers are functional.
Refer to the image below to see some of the wild temperature swings in the datacenter this past few days:
If the contractors can get the second cooling unit online by early next week, we will power up the full HPC as early as Monday. Either way, you can expect another message from the RCC Staff no later than 2pm on Monday.
Have a good weekend, and once again, thanks very much for your patience. As usual, comments and inquiries should be sent via email to email@example.com.
UPDATE; Thursday, June 3 @ 11:30am:
UPDATE; Tuesday, June 1 @ 10pm:
- The HPC cluster
- The Spear interactive cluster
- Parallel storage
- Archival storage
- Customer VMs and special servers
NOTE: Due to an upstream vendor issue, we've had to push the outage back approximately one month; this article has been updated to reflect the changed dates.
As part of the ongoing Sliger Renovation, the contractor (Arbitron-Williams) will be working on the data center's electrical and HVAC systems over the last weekend in May (Memorial Day). Everything hosted in the Sliger Data Center will have to be shut off before that happens, including all RCC systems.
Given the complexity of the task, we will need a full day before the actual outage begins to shut down all RCC systems and a full day to bring them back online after the outage is over. The outage will be for all RCC services, including, but not limited to the following:
- The High Performance Computing cluster,
- The Interactive Computing cluster (Spear),
- GPFS and Archival storage,
- Virtual Machines running on the "SKY" cluster
- The "vpn.fsu.edu/hpc" profile for the FSU VPN
The schedule will be as follows:
- Thursday, May 27; 8am - RCC powers down all RCC systems and will remain offline through Tuesday, June 1
- Tuesday, June 1; noon - Power and HVAC restored and running
- Tuesday, June 1; 6pm - Most RCC systems back online (we will send a notice about what's available and what's not)
- Wednesday, June 2; 5pm - All RCC systems back online
These are the best estimates we can provide at this time, and they may be subject to change between now and when the maintenance occurs. The RCC will make every effort to communicate further schedule changes promptly on our website, newsletter, and our systems notice list.
If your department has equipment in the Sliger Data Center, the power outage will affect your systems as well. We have already reached out to most affected departments but want to make sure everyone is on the same page.
RCC staff will contact you in the coming weeks to determine your specific power needs and work out a schedule for either shutting off your systems or transferring them over to temporary power via a generator.
While temporary generator-supplied power is an option for mission-critical systems, we strongly urge you to shut off non-critical servers. There will be no redundancy in power with UPS backups and no redundant cooling. Running systems on temporary power will be done at your own risk. If you choose to shut off your systems for the outage duration, we will provide protective coverings for your systems or racks to protect them from dust and other construction artifacts.
Whether you choose for your systems to remain on temporary power or not, a technician from your department will most likely need to be on site at the Sliger data center both before and after the maintenance.
The schedule for co-location customers is as follows:
- Thursday, May 27; 8am - Temporary power generator online; RCC staff on-site to assist departments in powering off or transferring their systems onto the generator
- Friday, May 28; 8am - Tues, June 1 - Downtime
- Tuesday, June 1; noon - Power and HVAC restored and running; RCC staff on-site all afternoon to assist departments with switching equipment back to permanent power
- Wednesday, June 2; 8am - 5pm - RCC staff on site all day to assist departments transition remaining equipment back to permanent power
What's being done
The reason for this power outage is to complete necessary upgrades to power and cooling infrastructure as part of the Sliger Renovation project scheduled to end in August 2021. The Sliger Building is decades old, and FSU has committed two million dollars to bring the infrastructure up-to-code.
While most of the improvements will be behind the scenes, there are a few notable upgrades; some highlights include:
- Enhanced power and cooling infrastructure to support RCC systems and co-location customers
- Dedicated switch and increased network speed delivered to racks
- Replacement of original fire suppression system
For questions, please reach out to us
Contact us at firstname.lastname@example.org.