Logo image
Enhancing soft error resilience in HPC systems through performance variation analysis and cross-layer optimization
Dissertation   Open access

Enhancing soft error resilience in HPC systems through performance variation analysis and cross-layer optimization

Zhengyang He
University of Iowa
Doctor of Philosophy (PhD), University of Iowa
Autumn 2025
DOI: 10.25820/etd.008223
pdf
Dissertation1.61 MBDownloadView
Open Access

Abstract

Soft errors, caused by radiation and transient faults, are becoming increasingly prevalent in modern HPC systems, leading to silent data corruptions (SDCs) that compromise result integrity. While many hardware-level techniques have been developed to mitigate these errors, software-level solutions remain critical due to their flexibility and portability across architectures. Among these, instruction duplication has emerged as a widely used and architecture-agnostic technique for soft error detection. By duplicating instructions and checking their consistency at runtime, it offers a general-purpose way to detect errors without requiring hardware modifications. However, existing research has largely focused on evaluating its SDC coverage, while two crucial aspects remain underexplored: (1) the true protection effectiveness after compilation, where transformations across abstraction layers may silently invalidate software-level protection guarantees, and (2) the runtime performance variation introduced by duplication under different workloads and programs. In this dissertation, I focus on addressing these two limitations.First, in ISSRE’23, I systematically characterize how duplication-induced performance varies depending on instruction patterns and hardware features such as branch prediction and register pressure, identifying factors of high and low overhead. Second, in SC’23, I analyze the root causes of protection deficiencies by examining instruction duplication across the LLVM IR and binary levels, revealing semantic mismatches and cross-layer inconsistencies that degrade fault coverage. Finally, in DSN’24, I propose a fast, low-overhead protection technique that allows efficient and selective protection while preserving strong error detection at the assembly level. Together, these efforts contribute a comprehensive understanding and a set of optimizations for instruction duplication, aiming to improve both practical error coverage and runtime efficiency in high-performance computing systems.
High Performance Computing Error Resilience Fault Tolerance Instruction Duplication

Details

Metrics

1 Record Views
Logo image