Systemwide Power Management Targeting Early Hardware Over-provisioned High Performance Computers

Date and time: 
Friday, June 2, 2017 - 12:00
Location: 
220 Deschutes
Author(s):
Daniel Ellsworth
University of Oregon
Host/Committee: 
  • Allen Malony (Chair)
  • Martin Schulz
  • Hank Childs
  • Reza Rejaie
  • Doug Toomey

Abstract

High Performance Computing (HPC) is an important enabling technology for big science, supporting simulation of phenomena and exploration of data sets that would be intractable to complete manually. The computational power of an HPC system is given by the number of floating point operations completed per second (FLOPS). Current generation HPC system are capable of 1015 FLOPS and over the past 7 years there has been significant effort to design a practical HPC system capable of achieving 1018 FLOPS. Designing a system able to provide 1018 useable FLOPS without excessive energy costs is one of the major barriers to reaching this computational performance target. Hardware overprovisioning has been suggested as a technique to increase useable FLOPS without increasing power however a power scheduler is required to safely deliver the improved performance. The construction and performance characteristics of such power schedulers are an open problem for the HPC community.

Significant contributions to the HPC power scheduling community are provided by this dissertation. A formalism that can be used to evaluate the safety of proposed power schedulers for deployment on hardware overprovisioned HPC systems is presented. Analysis of the general effects of processor power capping are used to develop a novel power scheduling and simulation approach, avoiding the need for extensive application modeling required by all other current power scheduling work. Evaluation of power scheduler performance is conducted through empirical studies on existing HPC hardware and through simulation.

This work shows that power scheduling can be done safely and effectively without application specific models using a simple feedback mechanism. Additionally, safety of any power scheduler for deployment can be proven through analyzing scheduler behavior and mechanism against a simple invariant. Further, this work shows dynamic power scheduler performance can be analyzed efficiently using simulation which avoids cost challenges associated with work on real HPC clusters. Finally, the work suggests a way forward for power scheduling research that will increase comparability and presents basic comparison results for power scheduling strategies.