Conference proceeding
cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio
Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp.1-18
ACM Conferences
SC '24: The International Conference for High Performance Computing, Networking, Storage, and Analysis
11/17/2024
DOI: 10.1109/SC41406.2024.00021
Abstract
Existing GPU lossy compressors suffer from expensive data movement overheads, inefficient memory access patterns, and high synchronization latency, resulting in limited throughput. This work proposes cuSZp2, a generic single-kernel error-bounded lossy compressor purely on GPUs designed for applications that require high speed, such as large-scale GPU simulation and large language model training. In particular, cuSZp2 proposes a novel lossless encoding method, optimizes memory access patterns, and hides synchronization latency, achieving extreme end-to-end throughput and optimized compression ratio. Experiments on NVIDIA A100 GPU with 9 real-world HPC datasets demonstrate that, even with higher compression ratios and data quality, cuSZp2 can deliver on average 332.42 and 513.04 GB/s end-to-end throughput for compression and decompression, respectively, which is around 2× of existing pure-GPU compressors and 200× of CPU-GPU hybrid compressors.
Details
- Title: Subtitle
- cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio
- Creators
- Yafan Huang - Computer Science Department, University of Iowa, Iowa City, IA, USASheng Di - Argonne National LaboratoryGuanpeng Li - Computer Science Department, University of Iowa, Iowa City, IA, USAFranck Cappello - Argonne National Laboratory
- Resource Type
- Conference proceeding
- Publication Details
- Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp.1-18
- Conference
- SC '24: The International Conference for High Performance Computing, Networking, Storage, and Analysis
- Series
- ACM Conferences
- DOI
- 10.1109/SC41406.2024.00021
- Publisher
- IEEE Press
- Grant note
- U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (ASCR): DE-AC02-06CH11357, DE-SC0024559 National Science Foundation: OAC-2003709, OAC-2104023, OAC-2311875, OAC-2211538
This material was based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (ASCR), under contract DE-AC02-06CH11357 and DE-SC0024559. The material was also supported by the National Science Foundation under Grant OAC-2003709, OAC-2104023, OAC-2311875, and OAC-2211538. The experimental resource for this paper was provided by the Laboratory Computing Resource Center on the Swing cluster at Argonne National Laboratory.
- Language
- English
- Date published
- 11/17/2024
- Academic Unit
- Computer Science
- Record Identifier
- 9984748158502771
Metrics
22 Record Views