We have completed restoration of the Lustre filesystem, and the system is now operational. Nearly all data was recovered during the restoration. The copy of data from our backup went much slower than we anticipated, but completed without error
A very small number of files on the system that were written between August 29 and September 1 may be corrupt or unrecoverable. If you encounter such a file, the best thing to do is to delete it.
Our plan over the next few weeks is to double the number of physical nodes in the Lustre system. This is the last step in the upgrade that we began during July, but was held up by the Hermine recovery effort. This will bring our Lustre system back up to a total of 4 OSTs (we've been running at 2 since the Lustre software upgrade completed in late August. Total storage will go from 168TB (currently) to 336TB (full capacity).
On the HPC, we reconfigured all owner-based partitions that were affected by the loss of the 2015/2016 compute nodes to run on alternative nodes. We are now working on replacing the nodes that were damaged in the storm. This will take some time, since we have to obtain repair quotes and coordinate with our vendor. Generally speaking, all HPC services are up and running.
Another service that has been down recently is our VM Cluster ("SKY"). Our Systems Team met last week and developed a plan to move all customer VMs onto a more robust set of hardware. The new hardware configuration consists of an enterprise-grade storage array and more powerful compute nodes. In addition, there are more redundancies in case of disk or software failure. This upgrade will occur over the next several weeks, and will be fully transparent; your systems will not incur significant downtime, nor will any IP addresses change.
We will provide updates as we move forward with these projects. In the meantime, please let us know if you have any questions or issues: firstname.lastname@example.org.