RCC News

Signup for our Newsletter

We publish a monthly newsletter which showcases our events and opportunities at the Research Computing Center.

Signup »

View Archives

News Archive

HPC Cheat Sheet

We've published a handy HPC Cheat Sheet. Download and Print it if you want a quick reference.

Scheduler Update: Memory Limits

Recently, we noticed a substantial amount of nodes crashing, causing job failures. We have been investigating this issue and have determined that the problem is related to memory issues. Jobs have been filling up all available RAM and swap partitions. Under Moab and RHEL 6.5, this issue did not show up, since offending jobs would get killed by the Linux kernel. Currently with the new scheduler, these jobs cause compute nodes to crash and reboot. Any running job on those nodes will fail without any meaningful error message sent to users ("node failure").

HPC Status Update

We've been tuning, tweaking, and fixing the HPC since we upgraded the system in July, and we have lots of updates to report on.

We're Hiring (SysAdmin)!

The RCC is hiring a systems administrator to work on our team at FSU. If you're interested, you should apply!

Status Report on the HPC

Here is a few updates on the HPC, including the state of accounts, job preemption, and other items.

Slurm Scheduler Issues Resolved

UPDATE 7pm: The HPC issues are resolved. Thanks for your patience.

We are currently experiencing issues on the HPC where Slurm commands are not responding. Our Systems Team is working to restore the service, and we will keep you posted as soon as we resolve the issue.

OpenMPI: Major Memory Leak Bug

UPDATE: Fri, Jul 24 - 9:00pm - We have completed compiling and redeploying the new version of OpenMPI. All systems are now running OpenMPI v1.8.7.

We have just been notified of a major memory leak with OpenMPI v1.8.6 (the current version on the HPC). This is a likely reason that many nodes have been crashing and disrupting jobs on the HPC this week.

HPC Upgrade Complete

The HPC is back online, and the new Slurm scheduler is generally available to all users.

If you experience any issues submitting your jobs, or if you have questions about using Slurm, please let us know: support@rcc.fsu.edu.

You may also want to refer to our online materials listed in our prior announcement.

Upgraded Software

Our Applications Team has been busy upgrading, recompiling, and testing software on our systems. We are pleased to provide upgraded versions of 126 packages.

In addition, we are upgrading the High Performance Computing cluster to RedHat Version 7.1, which includes a new Linux kernel and many other enhancements

Here are the packages, in alphabetical order:

System Upgrades (including Slurm) to occur Monday, July 13

We have scheduled our Slurm and RHEL7 upgrade to occur on Monday, July 13 through Sunday, July 19. During this time, our HPC and Spear systems will be unavailable.