Conference proceeding
Versatile Datapath Soft Error Detection on the Cheap for HPC Applications
Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp.1-15
ACM Conferences
SC '24: The International Conference for High Performance Computing, Networking, Storage, and Analysis
11/17/2024
DOI: 10.1109/SC41406.2024.00061
Abstract
With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, CONDA analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, CONDA detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that ConDa only incurs 57.79% runtime overhead, which is 41.84% faster than existing state-of-the-art, with the same level of error detection effectiveness and low detection latency.
Details
- Title: Subtitle
- Versatile Datapath Soft Error Detection on the Cheap for HPC Applications
- Creators
- Yafan Huang - Computer Science Department, University of Iowa, Iowa City, IA, USASheng Di - Argonne National LaboratoryZhaorui Zhang - Hong Kong Polytechnic UniversityXiaoyi Lu - University of California, MercedGuanpeng Li - Computer Science Department, University of Iowa, Iowa City, IA, USA
- Resource Type
- Conference proceeding
- Publication Details
- Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp.1-15
- Conference
- SC '24: The International Conference for High Performance Computing, Networking, Storage, and Analysis
- Series
- ACM Conferences
- DOI
- 10.1109/SC41406.2024.00061
- Publisher
- IEEE Press
- Grant note
- U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (ASCR): DE-AC02-06CH11357, DE-SC0024207, DE-SC0024559
This material was based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (ASCR), under contracts DE-AC02-06CH11357, DE-SC0024207, and DE-SC0024559.
- Language
- English
- Date published
- 11/17/2024
- Academic Unit
- Computer Science
- Record Identifier
- 9984748256002771
Metrics
20 Record Views