GRADUATE SCHOOL OF SCIENCES & ENGINEERING
COMPUTER SCIENCE AND ENGINEERING
PhD THESIS DEFENSE BY MUHAMMAD NUFAIL FAROOQI
Title: Asynchronous Runtime for AMR Applications on Exascale Systems
Speaker: Muhammad Nufail Farooqi
Time: September 06, 2018, 11:00
Place: ENG 208
Rumeli Feneri Yolu
Thesis Committee Members:
Asst. Prof. Dr. Didem Unat (Advisor, Koç University)
Prof. Dr. Metin Muradoğlu (Koç University)
Prof. Dr. Can Özturan (Boğaziçi University)
Asst. Prof. Dr. Alptekin Küpçü (Koç University)
Asst. Prof. Dr. Kamer Kaya (Sabancı University)
Adaptive Mesh Refinement (AMR) is an approach to solving partial differential equations that reduces the computational and memory cost for certain applications. AMR represents the computational domain as a hierarchy of successively refined grids, introducing both intra- and inter-level communications in the hierarchy. A straight-forward AMR algorithm typically exhibits many synchronization points even during a single time step, where costly communication often degrades the performance. This problem will be even more pronounced on exascale supercomputers containing billion way parallelism, which will raise the communication cost further. One common technique to reduce the communication overhead is to overlap communication with computation through asynchronous execution. Re-designing AMR algorithms to avoid synchronization and manually restructuring an AMR application to realize communication overlap is extremely complicated and increases software development
and maintenance costs due to the large code size and complex control structures.
We present a nonintrusive asynchronous approach to hide the effects of communication in an AMR application. Specifically, our approach reasons about data dependencies automatically using domain knowledge about AMR applications, allowing asynchrony to be discovered with only a modest amount of code modification. Using this approach, we proposed a phase asynchronous AMR algorithm and incorporated it in a widely used AMR framework, AMReX. Unlike synchronous algorithm where computation is delayed until all communication is finished, in phase asynchronous algorithm, a subgrid at an AMR level is scheduled for computation as soon as its dependent communication is completed. However, computation of all subgrids at an AMR level are completed before starting computation on a next AMR level. Our phase asynchronous AMR algorithm aims to achieve the performance of a fully asynchronous execution while retaining productivity of a synchronous approach.
We implemented a runtime system to facilitate the proposed phase asynchronous execution for AMR application. The runtime enables overlapping of both intra and inter-level communication with the computation within a level. Runtimes include communication handlers that perform both intra and inter-node communication. Applications hand over the communication data to the runtime’s communication handler. As soon as communication dependencies for a subgrid are completed, the runtime schedules that subgrid for computation.
We carefully designed an API for phase asynchronous execution model to facilitate transition of large legacy codes from synchronous to asynchronous execution. As a case study, we pick the CASTRO code, which solves multicomponent compressible hydrodynamic equations for astrophysical flows, as a real-world code to demonstrate our new asynchronous AMReX framework. We discuss the transformation strategy of CASTRO and evaluate the performance and programming effort required to use our new API. Our approach achieves the performance requirements while retaining
readability and long-term maintainability without severely affecting the productivity of the application programmer.
Supercomputer architectures are trending towards heterogeneity. At the moment, 110 out of top 500 supercomputers uses accelerators this number includes five machines from top 10. Thus programming models without support for heterogeneous architectures are not likely to be adapted on these machines. We extend our runtime to support execution on heterogeneous architectures. We present how our programming model and runtime system deal with the challenges raised when porting AMR application on heterogeneous architectures. The runtime schedules computation on both CPUs and GPUs concurrently to effectively use of all computational resources while maintaining the productivity.
For performance studies, we use a small heat advection application and a large real world production code, CASTRO. We demonstrate performance on Hazel Hen and Cori supercomputers; on the Intel Haswell cores and on the Intel Xeon Phi (Knights Landing- KNL). With reasonable programming effort we were able to achieve up to 50% performance improvements using 49K cores on Intel Haswell and 278,528 cores on KNL. We also carried out performance analysis of the heterogeneous architecture
support in the runtime using the heat advection code on SummitDev. SummitDev is a heterogeneous architecture machine equipped with IBM Power8 CPUs and NVIDIA Tesla P100 GPUs.