NOTE: This article has been updated on August 5 to reflect new, more precise information about the upgrade schedule.
We are planning an upgrade to the HPC software stack during the week beginning August 16. Unlike previous software upgrades, we will perform a rolling upgrade to minimize downtime. This means that parts of the cluster will remain online while other parts are being upgraded.
We plan on performing the maintenance during the week of August 16 - 20. We will send out an email and update this webpage when we have a detailed draft schedule prepared (est 2-3 weeks).
We are making the following major changes to our software stack to increase stability, fix bugs, and enhance the usability of the HPC:
- Upgade CentOS from v7 to v8.3
- Upgrade Slurm from v20.02 to v20.11 (release notes)
- Upgrade software packages to newer versions (partial list)
- Upgrade our Open OnDemand web portal from v1.7 to v2.0 (release notes)
- Make a major change in how we install new user software packages, like NetCDF and R. (details below)
Change to The upgrade process
The process for upgrading the cluster will require us to re-install the operating system on every compute node. To facilitate minimal downtime during the week of the upgrade, we will reinstall only small sets of nodes at a time, while keeping most nodes online. Once nodes have been reinstalled, they will immediately be put back into production and the next set of nodes will be reinstalled.
Since we will have to update all nodes, jobs will get killed periodically throughout the week. This also means that for a short time, the cluster will consist of two sets of nodes running two different software stacks, the old CentOS7 and the new CentOS8 builds. Users with more complicated scripts and workflows may want to wait until the upgrade is complete before submitting jobs, especially if your jobs utilize multiple queues/partitions.
The Login Nodes will be available, and the Slurm job scheduler will accept jobs throughout the upgrade. Also, access to our storage systems, GPFS and Archival, will remain online throughout the upgrade process.
Given all of this, we recommend the following to users:
- Do not submit any important multi-day jobs during the week of August 16 - 20 if you can avoid it.
- Be prepared for partial outages and cancelled or interrupted jobs as we reinstall operating system nodes.
Changes to the HPC software stack
We are improving the way that we organize the libraries under a standard scheme. We already support multiple versions of some software on the HPC (e.g. "R" and MATLAB), but the folder structure has not been consistent and can lead to confusion.
We encourage users to rely on environment modules when possible, which RCC staff keep up-to-date with the correct paths for libraries. If your code requires loading libraries for compilation, we encourage you to use the pkg-config tool instead of specifying the full path in your Makefiles or scripts. Refer to the pkg-config documentation on our website for more details. pkg-config is available on the HPC now, so you can refactor your custom scripts right away.
While we will make every effort to adhere to this schedule, it will be subject to change as issues arise.
- Aug 9 - 6-8pm - Upgrade HPC controller
- Aug 16 - 8am - Begin upgrade:
- Upgrade login nodes (two login nodes will be available at any given time)
- Upgrade Spear nodes (two Spear nodes will be available at any given time)
- Begin upgrade of free, general access HPC compute nodes
- Aug 17 thru 20 - Continue upgrading HPC nodes
- Complete all free, general access nodes.
- Complete owner nodes in M31 & M32 racks
- Complete owner nodes in M35 & M36 racks
- Complete owner GPU nodes
- Aug 23 - All systems upgraded
Details on queues owned and managed by research groups
The plan for the owner nodes will be to upgrade half of them at a time, so at least 50% of the owner nodes remain online at any given time. The general procedure per set of nodes will be:
- Set the node to "drain" state. In this state, the node will not accept new jobs and will try to complete as many currently running jobs as possible.
- Wait approximately 24 hours.
- Reinstall the operating system and software on the node
If you have any questions or comments, please reach out to us: firstname.lastname@example.org.