Enhancing soft error resilience in HPC systems through performance variation analysis and cross-layer optimization
Abstract
Details
- Title: Subtitle
- Enhancing soft error resilience in HPC systems through performance variation analysis and cross-layer optimization
- Creators
- Zhengyang He
- Contributors
- Guanpeng Li (Advisor)Muchao Ye (Committee Member)Tianyu Zhang (Committee Member)Weiran Wang (Committee Member)
- Resource Type
- Dissertation
- Degree Awarded
- Doctor of Philosophy (PhD), University of Iowa
- Degree in
- Computer Science
- Date degree season
- Autumn 2025
- DOI
- 10.25820/etd.008223
- Publisher
- University of Iowa
- Number of pages
- xiii, 62 pages
- Copyright
- Copyright 2025 Zhengyang He
- Language
- English
- Date submitted
- 11/30/2025
- Description illustrations
- Illustrations, graphs, charts, tables
- Description bibliographic
- Includes bibliographical references (pages 58-62).
- Public Abstract (ETD)
High-performance computing (HPC) systems power some of the world s most critical technologies ; from climate modeling to space exploration. However, as these systems grow more powerful, they also become more sensitive to subtle hardware faults. These small errors, known as soft errors, can silently alter a program s results without causing it to crash, making them especially dangerous.
This dissertation explores ways to make HPC software more resilient to such faults. It focuses on improving software-level error detection techniques by identifying their hidden weaknesses and enhancing their coverage across multiple levels of the computing stack. Through a series of practical experiments and new tools, this work demonstrates how to better protect scientific applications from invisible errors while keeping performance overhead low.
The goal of this research is to ensure that as we push the boundaries of science and engineering with advanced computing, we can also trust the results these systems produce. This work contributes toward building more reliable and trustworthy HPC software for future technologies and discoveries.
- Academic Unit
- Computer Science
- Record Identifier
- 9985135345702771