Journal article
GEREM: Fast and Precise Error Resilience Assessment for GPU Microarchitectures
IEEE transactions on parallel and distributed systems, Vol.36(5), pp.1011-1024
05/2025
DOI: 10.1109/TPDS.2025.3552679
Abstract
GPUs are widely used hardware acceleration platforms in many areas due to their great computational throughput. In the meanwhile, GPUs are vulnerable to transient hardware faults in the post-Moore era. Analyzing the error resilience of GPUs are critical for both hardware and software. Statistical fault injection approaches are commonly used for error resilience analysis, which are highly accurate but very time consuming. In this work, we propose GEREM, a first framework to speed up fault injection process so as to estimate the error resilience of GPU microarchitectures swiftly and precisely. We find early fault behaviors can be used to accurately predict the final outcomes of program execution. Based on this observation, we categorize the early behaviors of hardware faults into GPU Early Fault Manifestation models (EFMs). For data structures, EFMs are early propagation characteristics of faults, while for pipeline instructions, EFMs are heuristic properties of several instruction contexts. We further observe that EFMs are determined by static microarchitecture states, so we can capture them without actually simulating the program execution process under fault injections. Leveraging these observations, our GEREM framework first profiles the microarchitectural states related for EFMs at one time. It then injects faults into the profiled traces to immediately generate EFMs. For data storage structures, EFMs are directly used to predict final fault outcomes, while for pipeline instructions, machine learning is used for prediction. Evaluation results show GEREM precisely assesses the error resilience of GPU microarchitecture structures with 237\times speedup on average comparing with traditional fault injections.
Details
- Title: Subtitle
- GEREM: Fast and Precise Error Resilience Assessment for GPU Microarchitectures
- Creators
- Jingweijia Tan - Jilin Province Science and Technology DepartmentXurui Li - Jilin UniversityAn Zhong - Jilin Province Science and Technology DepartmentKaige Yan - Jilin UniversityXiaohui Wei - Jilin Province Science and Technology DepartmentGuanpeng Li - University of Iowa
- Resource Type
- Journal article
- Publication Details
- IEEE transactions on parallel and distributed systems, Vol.36(5), pp.1011-1024
- DOI
- 10.1109/TPDS.2025.3552679
- ISSN
- 1045-9219
- eISSN
- 1558-2183
- Publisher
- IEEE
- Number of pages
- 14
- Grant note
- National Natural Science Foundation of China (NSFC): 62372207
This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62372207.
- Language
- English
- Electronic publication date
- 03/17/2025
- Date published
- 05/2025
- Academic Unit
- Computer Science
- Record Identifier
- 9984802410202771
Metrics
6 Record Views