Conference proceeding
Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications
2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.582-594
05/27/2024
DOI: 10.1109/IPDPS57955.2024.00058
Abstract
Due to the increasing scale of high-performance computing (HPC) systems, transient hardware faults have become a major reliability concern. Consequently, Silent Data Corruptions (SDCs) due to these faults have been a common insidious consequence in GPU applications. Developers often measure the application resilience with a set of program test inputs available in the benchmark suite, assuming the resilience would not fluctuate much among different inputs. However, we observe that this assumption often results in an over-optimistic evaluation for GPU applications. As a result, the subsequent SDC protection following the evaluation can hardly meet the expected reliability bar in the production environment, where applications would run with potentially arbitrary input values. To this end, we propose Druto - a compiler-based automated technique that searches for inputs to incrementally approach the upper bound of a GPU application's SDC probability. We develop Druto based on the property that the resilience profiles of a small group of representative threads in a GPU kernel can approximately rank various inputs in terms of the overall SDC probability. Therefore, Druto strategically steers the search towards new program inputs that efficiently portray the overall SDC probability. Evaluation shows that the SDC probability derived from Druto's input generation is as much as 74× higher than that from existing techniques. Moreover, existing techniques cannot find our generated inputs even given 5× more search time.
Details
- Title: Subtitle
- Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications
- Creators
- Md Hasanur Rahman - University of Iowa,IA,USASheng Di - Argonne National LaboratoryShengjian Guo - AmazonXiaoyi Lu - University of California, MercedGuanpeng Li - University of Iowa,IA,USAFranck Cappello - Argonne National Laboratory
- Resource Type
- Conference proceeding
- Publication Details
- 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.582-594
- Publisher
- IEEE
- DOI
- 10.1109/IPDPS57955.2024.00058
- ISSN
- 1530-2075
- eISSN
- 1530-2075
- Grant note
- Office of Science (10.13039/100006132) Advanced Scientific Computing Research (10.13039/100006192) U.S. Department of Energy (10.13039/100000015)
- Language
- English
- Date published
- 05/27/2024
- Academic Unit
- Computer Science
- Record Identifier
- 9984658351002771
Metrics
7 Record Views