Logo image
Versatile Datapath Soft Error Detection on the Cheap for HPC Applications
Conference proceeding

Versatile Datapath Soft Error Detection on the Cheap for HPC Applications

Yafan Huang, Sheng Di, Zhaorui Zhang, Xiaoyi Lu and Guanpeng Li
Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp.1-15
ACM Conferences
SC '24: The International Conference for High Performance Computing, Networking, Storage, and Analysis
11/17/2024
DOI: 10.1109/SC41406.2024.00061

View Online

Abstract

With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, CONDA analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, CONDA detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that ConDa only incurs 57.79% runtime overhead, which is 41.84% faster than existing state-of-the-art, with the same level of error detection effectiveness and low detection latency.
Hardware Hardware -- Hardware test Hardware -- Hardware validation Hardware -- Hardware validation -- Post-manufacture validation and debug Hardware -- Hardware validation -- Post-manufacture validation and debug -- Bug detection, localization and diagnosis Hardware -- Robustness Hardware -- Robustness -- Hardware reliability Hardware -- Robustness -- Hardware reliability -- Transient errors and upsets Software and its engineering Software and its engineering -- Software creation and management Software and its engineering -- Software creation and management -- Software development techniques Software and its engineering -- Software creation and management -- Software development techniques -- Error handling and recovery Software and its engineering -- Software creation and management -- Software verification and validation Software and its engineering -- Software creation and management -- Software verification and validation -- Software defect analysis Software and its engineering -- Software notations and tools Software and its engineering -- Software notations and tools -- Compilers Software and its engineering -- Software organization and properties Software and its engineering -- Software organization and properties -- Software functional properties Software and its engineering -- Software organization and properties -- Software functional properties -- Formal methods Software and its engineering -- Software organization and properties -- Software functional properties -- Formal methods -- Automated static analysis

Details

Metrics

20 Record Views
Logo image