Evolution of Monitoring over the Lifetime of a High Performance Computing Cluster

Adam DeConinck; Kathleen Kelly

doi:10.1109/CLUSTER.2015.123

Evolution of Monitoring over the Lifetime of a High Performance Computing Cluster

Source

2015 IEEE International Conference on Cluster Computing > 710 - 713

Abstract

High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.