UPDATE - Tuesday, Nov 28 - 4:50pm - The HPC and the Spear clusters are back online. You may now submit jobs to the cluster.
A few nodes need additional work, but the majority of the cluster is up and running jobs. Thanks for your patience. We ran into several residual issues today, which unfortunately prolonged the downtime by a several hours.
We now believe that the PanFS performance issues were caused by the ongoing maintenance we have been conducting on HPC nodes over the past few months. As we have moved more and more nodes from the old network to the new network, more traffic has had to be routed through the uplink on the old network switch. Eventually, the load became too high, and the old switch started dropping packets. This resulted in serious performance issues.
The solution (which we did today) was to move the storage system to the new network. We had planned on doing this anyway (closer to the holidays), but were forced to expedite it in order to solve the performance issues.
We'll keep an eye on things to make sure they are back to normal, and if you see any issues, please let us know (email@example.com).
UPDATE - Tuesday, Nov 28 - 3:30pm - Most HPC compute nodes (and Spear) are back online. You may now submit jobs to the cluster. Thanks very much for your patience.
We are still cleaning up a few residual issues, mainly on the end-of-life, general access compute nodes. We will send an email out at 4:30pm today with a final update.
UPDATE - Tuesday, Nov 28 - 2:15pm - HPC Login nodes are back online, but there are still residual clean-up tasks to be done. We will post another update no later than 4:30pm today. Thanks for your patience.
UPDATE - Tuesday, Nov 28 - 11:30am - We were able to complete the PanFS migration to a new network switch this morning. Unfortunately, it appears that all compute nodes, login nodes, and Spear nodes must be rebooted in order to connect to the storage server on the new network correctly.
We anticipate that this will take two hours or so. During this time, HPC, Spear, and Globus will be unavailable. We will post another notice (and send another email) as soon as everything is back online.
As you know, we've been experiencing ongoing HPC performance issues over the past few weeks. We have identified the source of the problem: a faulty network switch (see below).
In order to fix this issue, we need to perform an emergency fix to our storage system (PanFS), and move the system to a new network switch. We will perform thtis maintenance Tuesday, November 28 from 9-11am. This will affect the entire HPC cluster.
In order to prepare, we have configured the cluster to stop scheduling all jobs starting immediately (today, Nov 22, at noon). If you submit a job to the cluster between now and Tuesday, it will receive a Job ID be queued, but it will not start. Already–running jobs will continue to run (for now; see below)
We do not know how running jobs will be affected when we begin the maintenance on Tuesday. We advise users to "plan for the worst" and assume that any jobs running at 9am on Tuesday will fail and need to be resubmitted. The PanFS system will briefly dissapear from the HPC, so at the very least, jobs will not have access to disk I/O for a few moments.
If you have any questions, please let us know: firstname.lastname@example.org.
For the last few weeks, we have witnessed performance issues with our Panasas storage system. Although our initial focus was on troubleshooting the storage itself, we found an issue with one of our main network core switches, a 9 year old cisco 6509, that is dropping data packets.
Although we were already planning to move all our equipment to a newer core switch, we will now have to accelerate some of our plans.
In the coming weeks, we will connect our Panasas system to our new network core switch. This move will temporarily disconnect the storage to all of our equipment. Even though the outage might only last for a maximum of 15 minutes, we have no good way of predicting how any running software might respond.
Once we have worked out details and a time frame, we will post more updates.