- Joseph Sventek
We are entering a golden era for high performance computing (HPC). Parallel computation, analysis, movement, and storage of large amounts of data are functions necessary across a wide array of computational domains, including scientific computing, graph processing, and machine learning. To cull the benefits of large scale HPC systems, these applications must often globally synchronize state across many parallel processing elements. The focus of this talk is performance variability, whereby parallel processing elements make non-uniform progress, generating "stragglers" which slow synchronization and thus the progress of applications.
This talk will first focus on the operating systems (OS) in large scale HPC systems, and will present research in specialized, multi-stack OSes that eliminate variability imposed by conventional general purpose OSes (e.g., Linux). The talk will then discuss challenges that variability poses for emerging HPC architectures, and will motivate the need for "variability-tolerant" parallelism. The talk will present key questions related to the design of variability-tolerance, and will motivate the use of stochastic and distributed optimization techniques to design dynamic, variability-tolerant system software.
Brian Kocoloski is a Ph.D. candidate in the Department of Computer Science at the University of Pittsburgh. He received his B.S in Computer Science at the University of Dayton in 2011. He spent the summer of 2013 as an intern in the Scalable System Software group at Sandia National Laboratories, and the summer of 2015 as an intern at AMD Research.
The theme of his research is to make it easier to efficiently utilize large parallel computers. He has designed operating systems and virtualization techniques to provide specialized, low-overhead environments for tightly synchronized parallel applications. His work is currently being leveraged in Hobbes, a US Department of Energy operating system for future exascale computers. He is also interested in distributed optimization techniques, particularly as they pertain to parallel runtimes in large scale systems.