Group Vision Transformer

Yaopeng Peng; Milan Sonka; Danny Z. Chen

doi:10.1145/3664647.3681709

Back

Conference proceeding

Group Vision Transformer

Yaopeng Peng, Milan Sonka and Danny Z. Chen

Proceedings of the 32nd ACM International Conference on Multimedia, pp.2623-2631

ACM Conferences

MM '24: The 32nd ACM International Conference on Multimedia

10/28/2024

DOI: 10.1145/3664647.3681709

View Online

Abstract

The Vision Transformer has attained remarkable success in various computer vision applications. However, the large computational costs and complex design limit its ability in handling large feature maps. Existing research predominantly focuses on constraining attention to small local regions, which reduces the number of tokens attending the attention computation while overlooking computational demands caused by the feed-forward layer in the Vision Transformer block. In this paper, we introduce Group Vision Transformer (GVT), a relatively simple and efficient variant of Vision Transformer, aiming to improve attention computation. The core idea of our model is to divide and group the entire Transformer layer, instead of only the attention part, into multiple independent branches. This approach offers two advantages: (1) It helps reduce parameters and computational complexity; (2) it enhances the diversity of the learned features. We conduct comprehensive analysis of the impact of different numbers of groups on model performance, as well as their influence on parameters and computational complexity. Our proposed GVT demonstrates competitive performances in several common vision tasks. For example, our GVT-Tiny model achieves 84.8% top-1 accuracy on ImageNet-1K, 51.4% box mAP and 45.2% mask mAP on MS COCO object detection and instance segmentation, and 50.1% mIoU on ADE20K semantic segmentation, outperforming the CAFormer-S36 model by 0.3% in ImageNet-1K top-1 accuracy, 1.2% in box mAP, 1.0% in mask mAP on MS COCO object detection and instance segmentation, and 1.2% in mIoU on ADE20K semantic segmentation, with similar model parameters and computational complexity. Code is accessible at https://github.com/yaoppeng/GVT.

Computing methodologies

Details

Title: Subtitle: Group Vision Transformer
Creators: Yaopeng Peng - University of Notre Dame
Milan Sonka - University of Iowa
Danny Z. Chen - University of Notre Dame
Resource Type: Conference proceeding
Publication Details: Proceedings of the 32nd ACM International Conference on Multimedia, pp.2623-2631
Conference: MM '24: The 32nd ACM International Conference on Multimedia
Series: ACM Conferences
DOI: 10.1145/3664647.3681709
Publisher: ACM
Language: English
Date published: 10/28/2024
Academic Unit: Roy J. Carver Department of Biomedical Engineering; Electrical and Computer Engineering; Radiation Oncology; Fraternal Order of Eagles Diabetes Research Center; Injury Prevention Research Center; Ophthalmology and Visual Sciences
Record Identifier: 9984738939402771

Metrics

53 Record Views