Conference proceeding
Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors
2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), pp.41-49
11/2019
DOI: 10.1109/FTXS49593.2019.00010
Abstract
Hardware faults (i.e., soft errors) are projected to increase in modern HPC systems. The faults often lead to error propagation in programs and result in silent data corruptions (SDCs), seriously compromising system reliability. Selective instruction duplication, a widely used software-based error detector, has been shown to be effective in detecting SDCs with low performance overhead. In the past, researchers have relied on compiler intermediate representation (IR) for program reliability analysis and code transformation in selective instruction duplication. However, they assumed that the IR-based analysis and protection are representative under realistic fault models (i.e., faults originated at lower hardware layers). Unfortunately, the assumptions have not been fully validated, leading to questions about the accuracy and efficiency of the protection since IR is a higher level of abstraction and far away from hardware layers. In this paper, we verify the assumption by injecting realistic hardware faults to programs that are guided and protected by IR-based selective instruction duplication. We find that the protection yields high SDC coverage with low performance overhead even under realistic fault models, albeit a small amount of such faults escaping the detector. Our observations confirm that IR-based selective instruction duplication is a cost-effective method to protect programs from soft errors.
Details
- Title: Subtitle
- Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors
- Creators
- Chun-Kai Chang - The University of Texas at AustinGuanpeng Li - University of British ColumbiaMattan Erez - The University of Texas at Austin
- Resource Type
- Conference proceeding
- Publication Details
- 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), pp.41-49
- DOI
- 10.1109/FTXS49593.2019.00010
- Publisher
- IEEE
- Language
- English
- Date published
- 11/2019
- Academic Unit
- Computer Science
- Record Identifier
- 9984259474202771
Metrics
114 Record Views