Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Institut für Informatik

Probevortrag zur Promotion: Masoud Gholami

Optimizing Checkpoint/Restart and Input/Output for Large Scale Applications

Der Vortrag findet im Humboldt-Kabinett statt; eine Online-Teilnahme per Zoom wird auch möglich sein. 

Abstract

In the dynamic realm of exascale computing and HPC, failures are not occasional but rather inherent, occurring routinely during the runtime of applications. Effectively addressing these challenges is essential to enhance the resilience and reliability of supercomputing operations. Checkpoint/Restart (C/R) is a technique used in HPC to improve job resilience in the case of failures. This involves periodically saving the state of a running application to disk, so that if the application fails, it can be restarted from the last checkpoint. However, checkpointing can be time-consuming and significantly impact application performance, particularly I/O operations that involve writing the application state to disk. Therefore, optimizing the C/R process is crucial for reducing its impact on application performance and improving job resilience.

The first part of this work explores and develops novel techniques and approaches in the realm of C/R management within the context of HPC. This includes introducing and developing a novel C/R approach by combining XOR and partner C/R mechanisms, developing a model for multilevel C/R in large computational resources, and optimising the shared usage of burst buffers for C/R in supercomputers.

C/R procedures generate substantial I/O operations, emerging as a bottleneck for HPC applications. Hence, the need for optimization in I/O processes becomes imperative to overcome this bottleneck. To optimize the C/R process, it is also important to understand the I/O behavior of an application, including how much data needs to be written, how frequently checkpoints should be taken, and where to store the checkpoints to minimize I/O bottlenecks. Hence, in the second part, we investigate and introduce innovative techniques and approaches for I/O modeling and management. This includes developing a plugin for GNU C Compiler (GCC) that selects the optimal storage device for the I/O of applications based on their behavior that is defined by Pragma notions, and developing a model to estimate I/O cost of applications under Linux considering page management and process throttling.

--