Performance Observability and Monitoring of High Performance Computing With Microservices

Srinivasan Ramesh
Date and time: 
Tue, May 24 2022 - 3:00pm
220 Deschutes and via Zoom
Srinivasan Ramesh
University of Oregon
  • Allen Malony (Chair)
  • Hank Childs
  • Boyana Norris
  • Dare Baldwin (Psychology)
  • Dr. Robert B. Ross (Argonne National Laboratory)

Traditionally, High Performance Computing (HPC) software has been built and deployed as bulk-synchronous, parallel executables based on the message-passing interface (MPI) programming model. The rise of data-oriented computing paradigms and an explosion in the variety of applications that need to be supported on HPC platforms has forced a re-think of the appropriate programming and execution models to integrate this new functionality. Service-oriented architectures and a broader class of in situ workflows demarcate a paradigm shift in HPC software development methodologies that have enabled a range of new applications — from user-level data services to machine learning (ML) workflows that run alongside traditional scientific simulations.

This dissertation proposes techniques and accompanying tools to enable the performance observability and monitoring of in situ HPC workflows that involve distributed services. Conversely, the dissertation also demonstrates the value of deploying performance monitoring and visualization as shared services within an in situ workflow. The results from this dissertation suggest that: (1) appropriate integration of performance data from different sources is vital to understanding the performance of service components, (2) the in situ (online) analysis of this performance data is needed to enable adaptivity of distributed components, and (3) services are a promising architecture choice for deploying in situ performance monitoring and visualization functionality.