Conference proceeding
Experience report: An application-specific checkpointing technique for minimizing checkpoint corruption
2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp.141-152
11/2015
DOI: 10.1109/ISSRE.2015.7381808
Abstract
Checkpointing is widely deployed in computer systems to recover from failures due to both hardware and software errors. However, as faults propagate, checkpoints may become corrupted by saving erroneous states and make errors unrecoverable, especially at aggressive checkpoint frequencies. In this paper, we proposed a technique that automatically analyzes a given program to guide checkpoint strategies in order to minimize checkpoint corruptions. To understand checkpoint corruptions, we first perform a large-scale fault injection study across ten benchmark applications. We then classify checkpoint corruptions, and comprehensively characterize the fault propagations leading to these corruptions. Leveraging these findings, we build ReCov, a compiler-based tool that automatically identifies the program locations that have lowest density of fault propagation for placing checkpoints, and combines it with low-overhead protection techniques. Our experimental results shows that ReCov can eliminate nearly 92% of the checkpoint corruptions with about 5% performance overhead. ReCov reduces the unavailability of the system by 8.25 times even at very aggressive checkpoint frequencies, showing that it is effective in practice.
Details
- Title: Subtitle
- Experience report: An application-specific checkpointing technique for minimizing checkpoint corruption
- Creators
- Guanpeng Li - University of British ColumbiaKarthik Pattabiraman - University of British ColumbiaChen-Yong Cher - IBM (United States)Pradip Bose - IBM (United States)
- Resource Type
- Conference proceeding
- Publication Details
- 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp.141-152
- DOI
- 10.1109/ISSRE.2015.7381808
- Publisher
- IEEE
- Language
- English
- Date published
- 11/2015
- Academic Unit
- Computer Science
- Record Identifier
- 9984259418302771
Metrics
13 Record Views