HPC Upgrade

Greetings HPC Users,

We are planning the annual HPC maintenance period for this summer. This maintenance will occur in stages between July 25 and August 19. We will need to shutoff the HPC, Spear and other systems for either part or the whole maintenance period. All running HPC jobs will need to complete before the maintenance or the compute nodes or be terminated when we turn off the system.

Objectives

Some of the objectives of this maintenance period include:

  • Move our infrastructure from RedHat Enterprise Linux to CentOS
  • Add additional login nodes to our cluster
  • Upgrade Intel and PGI Compilers to latest versions
  • Upgrade Lustre
  • Upgrade the Slurm Scheduler to v15.08
  • Improve system deployment and configuration

Timeline

  • Monday, July 25 at 9am - We will turn off Lustre and Spear servers
  • Sunday, August 7 at 5pm - We will begin draining running jobs on the HPC
  • Monday, August 8 at 9am - We will shutdown all HPC nodes, including login nodes, and begin the upgrade
  • August 8 through August 19 - We will upgrade all systems and notify users as they come back online
  • August 22 at 9am - Maintenance will be complete. All systems will be online.

Some systems will resume service before August 22. We will keep users notified via our Website and our Twitter (@fsurcc) acount.

In Depth

Move our infrastructure from RedHat Enterprise Linux to CentOS

We are moving all of our servers to CentOS across our entire infrastructure. This includes HPC, Spear, and all auxiliary/support systems.

Currently, FSU maintains an enterprise license to use the RedHat Linux operating system. Because of the number of servers we maintain, RCC uses the majority of this subscription. The CentOS operating system is an actively-maintained clone of the RedHat OS, and it is open-source. By moving to CentOS, we can both simplify our systems management workflow (no licensing steps involved) and save the University a significant recurring cost. RCC users should notice no difference between the two systems.

Add login nodes to our cluster

As part of the upgrade, we will be adding a fourth login node to the HPC Login Cluster. This will help alleviate resource constraints on our general-access login nodes. Typically, there are between three and four dozen users simultaneously logged into each login node. Adding more nodes will help ensure that all users are able to access to the resources they need without significant slowdown.

Upgrade Intel and PGI Compilers to latest versions

We are upgrading our Intel compiler to Version 16.04 and PGI compiler to version 16.3. GCC compilers will remain at the current version (v4.8.3). These compilers are already available on the HPC Login nodes, but the associated MPI modules will not be available until the upgrade is complete.

Upgrade Lustre

The current version of Lustre at the RCC is dated and needs upgrading. The vendor we originally contracted with is now out of business, and our support contract has been moved to Intel. In order to take advantage of the support agreement, and to improve system stability and performance, we are upgrading to the latest enterprise edition offered by Intel.

The Lustre upgrade will involve copying a significant amount of data onto new storage nodes, which is why we must disable the service.

Upgrade the Slurm Scheduler to v15.08

We are upgrading the HPC scheduler to the latest stable version of Slurm (v15.08). This version includes some bugfixes, stability enhancements, and minor feature additions. Release notes are available at: http://slurm.schedmd.com/news.html

Improve system deployment and configuration

We are doing quite a few things on the back-end in order to streamline system configuration and deployment. With over 800 nodes to maintain and manage, any improvement in efficiency makes a big difference. Some activities include:

  • Simplify how user accounts are managed on systems. We are moving from a home-grown tool to an industry-standard, well-documented and supported tool.
  • Upgrade our InfiniBand drivers
  • Upgrade our provisioning/deployment tool from RedHat Satellite and Foreman to Katello.
  • Upgrade storage drivers

If you have any questions or concerns, please let us know: support@rcc.fsu.edu