Lab Home | Phone | Search | ||||||||
|
||||||||
Checkpoint restart is a vital component to long running HPC applications. As systems grow in size and complexity, applications compute on larger data sets. Due to data movement bottlenecks, traditional checkpointing approaches that save a subset of the application’s state are becoming prohibitive. Reducing the checkpoint size via lossy compression offers the ability to improve checkpoint restart performance. In this talk, we investigate a methodology for selecting lossy compression error tolerances for checkpointing HPC applications based on numerical properties of partial differential equation (PDE) simulations, such as bounds on the truncation error. We explore the methodology on 1D model problems and two production level applications: PlasComCM and Nek5000. We highlight that this methodology allows error in application variables due to a restart from a lossy compressed checkpoint to be masked by the numerical error in the discretization. This leads to increased efficiency in checkpoint restart without influencing overall accuracy in the simulation. Furthermore, the results show that this methodology is robust to selection of lossy compressor. Host: Information Science and Technology Institute (ISTI) |