Conference proceeding
Group Vision Transformer
Proceedings of the 32nd ACM International Conference on Multimedia, pp.2623-2631
ACM Conferences
MM '24: The 32nd ACM International Conference on Multimedia
10/28/2024
DOI: 10.1145/3664647.3681709
Abstract
The Vision Transformer has attained remarkable success in various computer vision applications. However, the large computational costs and complex design limit its ability in handling large feature maps. Existing research predominantly focuses on constraining attention to small local regions, which reduces the number of tokens attending the attention computation while overlooking computational demands caused by the feed-forward layer in the Vision Transformer block. In this paper, we introduce Group Vision Transformer (GVT), a relatively simple and efficient variant of Vision Transformer, aiming to improve attention computation. The core idea of our model is to divide and group the entire Transformer layer, instead of only the attention part, into multiple independent branches. This approach offers two advantages: (1) It helps reduce parameters and computational complexity; (2) it enhances the diversity of the learned features. We conduct comprehensive analysis of the impact of different numbers of groups on model performance, as well as their influence on parameters and computational complexity. Our proposed GVT demonstrates competitive performances in several common vision tasks. For example, our GVT-Tiny model achieves 84.8% top-1 accuracy on ImageNet-1K, 51.4% box mAP and 45.2% mask mAP on MS COCO object detection and instance segmentation, and 50.1% mIoU on ADE20K semantic segmentation, outperforming the CAFormer-S36 model by 0.3% in ImageNet-1K top-1 accuracy, 1.2% in box mAP, 1.0% in mask mAP on MS COCO object detection and instance segmentation, and 1.2% in mIoU on ADE20K semantic segmentation, with similar model parameters and computational complexity. Code is accessible at https://github.com/yaoppeng/GVT.
Details
- Title: Subtitle
- Group Vision Transformer
- Creators
- Yaopeng Peng - University of Notre DameMilan Sonka - University of IowaDanny Z. Chen - University of Notre Dame
- Resource Type
- Conference proceeding
- Publication Details
- Proceedings of the 32nd ACM International Conference on Multimedia, pp.2623-2631
- Conference
- MM '24: The 32nd ACM International Conference on Multimedia
- Series
- ACM Conferences
- DOI
- 10.1145/3664647.3681709
- Publisher
- ACM
- Language
- English
- Date published
- 10/28/2024
- Academic Unit
- Roy J. Carver Department of Biomedical Engineering; Electrical and Computer Engineering; Radiation Oncology; Fraternal Order of Eagles Diabetes Research Center; Injury Prevention Research Center; Ophthalmology and Visual Sciences
- Record Identifier
- 9984738939402771
Metrics
53 Record Views