libcrpm: Improving the Checkpoint Performance of NVM

Abstract

libcrpm is a new programming library to improve the checkpoint performance for applications running in NVM. It proposes the failure-atomic differential checkpointing protocol, which addresses two problems simultaneously that exist in the current NVM-based checkpoint-recovery libraries: (1) high write amplification when page-granularity incremental checkpointing is used, and (2) high persistence costs from excessive memory fence instructions when fine-grained undo-log or copy-on-write is used. Evaluation results show that libcrpm reduces the checkpoint overhead in realistic workloads. For MPI-based parallel applications such as LULESH, the checkpoint overhead of libcrpm is only 44.78% of FTI, an application-level checkpoint-recovery library.

Publication
59th Design Automation Conference