As of October 1, the RCC has almost fully recovered from the Datacenter damage inflicted by Hurricane Hermine. The one remaining major item is the replacement of the 2015 HPC compute nodes.
As of October 1, the RCC has almost fully recovered from the Datacenter damage inflicted by Hurricane Hermine. The one remaining major item is the replacement of the 2015 HPC compute nodes. We have submitted a quote for replacement systems, and are now waiting on the order to be processed and to receive the systems. We will let users know when this happens.
On Friday morning, September 1, Hurricane Hermine hit Tallahassee and caused the chilled water system on campus to fail. This system supplies chilled water for cooling to our datacenter on the fourth floor in the Dirac building. The building generator did not fail, so all of the systems in that facility continued to run and generate heat. Temperatures in the data center quickly rose to over 100 degrees F. After a short time, the network switches that provide connectivity to the building failed due to heat. This limited our ability to assess and respond to the situation until after the storm.
On Friday, after the storm, RCC staff discovered that the excessive heat from the datacenter had triggered the building sprinkler system to engage, which flooded the fourth floor of the building and much of the third floor. We were able to enter the facility on Friday afternoon and begin shutting off any systems that were still running. Our assessment of the situation showed that only one sprinkler head engaged in the Datacenter room. Unfortunately, this was directly over the recently purchased 2015 HPC Compute Nodes.
Before being able to power on and assess the status of any systems in the facility, it was necessary to restore climate control, and bring the temperature and humidity to acceptable levels. Since the sprinkler system had engaged and the air conditioning was off for an extended period, there was a high probability that water droplets had formed on circuitry, which would cause fatal damage to systems if powered on. We worked with FSU Facilities and our cooling support vendor to repair damaged air units, which took until Monday morning to complete.
On Monday morning (Labor Day), we began to assess systems and move critical nodes to our second datacenter in the Sliger building. Fortunately, most of our servers survived the storm and the heat. We lost several network switches, which were quickly replaced, as well as a number of hard drives and other components.
The two major casualties of the storm were the Lustre system and the 2015 HPC Compute nodes. One of the storage servers in Lustre had failed prior to the storm, but due to a misconfiguration, we did not discover the issue until the storm necessitated a reboot. We subsequently had to isolate that node and restore data from backup, which took several weeks. The 2015 HPC nodes were damaged when the sprinkler system engaged. Our assessment is that 76 nodes need to be completely replaced. Together with vice-president Kyle Clark and provost Sally McRorie, we are working on appropriating funds and procuring replacement systems.
Plans for the Future
RCC wants to provide as reliable level of service possible with the resources that we have. Hermine exposed several weak points in our datacenter management, which we are working to correct. In this regard, we are developing a disaster readiness procedure for future major weather events. We are also improving our temperature management and monitoring. We are working with our parent department, ITS, and with the Department of Scientific Computing to explore consolidating our data center and eliminate redundancies in management infrasturcture.
We recently held our annual Advisory Board meeting. One concern that came up was communication during the incident and during recovery efforts. We will certainly make this a key focus going forward.