Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Masuma Akter Rumi; Xiaolong Ma; Yanzhi Wang; Peng Jiang

doi:10.1145/3410463.3414648

Back

Conference proceeding

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Masuma Akter Rumi, Xiaolong Ma, Yanzhi Wang and Peng Jiang

PACT '20: PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, pp.267-278

International Conference on Parallel Architectures and Compilation Techniques

01/01/2020

DOI: 10.1145/3410463.3414648

View Online

Abstract

Weight pruning is a popular technique to reduce the size and computation complexity of the Convolutional Neural Networks (CNNs). Despite its success in reducing the model size, weight pruning has brought limited benefit to the CNN inference performance, due to the irregularity introduced in the sparse convolution operations. In this work, we aim to improve the performance of sparse convolutions on GPUs by mitigating the irregularity. We find that the existing performance optimization techniques for sparse matrix computations fail to accelerate sparse convolutions, and we observe that the main performance bottleneck is caused by the heavy control-flow instructions. Based on the observation, we proposed a new GEMM-based implementation of sparse convolutions. Our main idea is to extract dense blocks of non-zeros in the sparse convolution kernels, and use dense matrix-matrix multiplication for these dense blocks to achieve high throughput. For cases where many non-zero weights cannot be grouped into dense blocks, we propose a performance-aware re-pruning strategy that removes the least important weights in the sparse kernels to further improve the throughput. The experimental results with five real-world pruned CNN models show that our techniques can significantly improve the layer-wise performance of sparse convolution operations as well as the end-to-end performance of CNN inference.

Computer Science

Technology

Computer Science, Hardware & Architecture

Computer Science, Software Engineering

Computer Science, Theory & Methods

Science & Technology

Details

Title: Subtitle: Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning
Creators: Masuma Akter Rumi - University of Iowa
Xiaolong Ma - Northeastern University
Yanzhi Wang - Northeastern University
Peng Jiang - University of Iowa
Resource Type: Conference proceeding
Publication Details: PACT '20: PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, pp.267-278
Publisher: Assoc Computing Machinery
Series: International Conference on Parallel Architectures and Compilation Techniques
DOI: 10.1145/3410463.3414648
ISSN: 1089-795X
eISSN: 2641-7944
Number of pages: 12
Language: English
Date published: 01/01/2020
Academic Unit: Computer Science
Record Identifier: 9984410841202771

Metrics

5 Record Views

15 Times Cited - Web of Science