Stacked Multimodal Attention Network for Context-Aware Video Captioning

Yi Zheng; Yuejie Zhang; Rui Feng; Tao Zhang; Weiguo Fan

doi:10.1109/TCSVT.2021.3058626

Back

Stacked Multimodal Attention Network for Context-Aware Video Captioning

Journal article

Peer reviewed

Stacked Multimodal Attention Network for Context-Aware Video Captioning

Yi Zheng, Yuejie Zhang, Rui Feng, Tao Zhang and Weiguo Fan

IEEE transactions on circuits and systems for video technology, Vol.32(1), pp.31-42

01/2022

DOI: 10.1109/TCSVT.2021.3058626

View Online

Abstract

Recent neural models for video captioning usually employ an attention-based encoder-decoder framework. However, current approaches mainly attend to the motion features and object features of the video when generating the caption, but ignore the potential but useful historical information. Besides, exposure bias and vanishing gradients problems always exist in current caption generation models. In this paper, we propose a novel video captioning framework, named Stacked Multimodal Attention Network (SMAN). It adopts additional visual and textual historical information during caption generation as context features, employs a stacked architecture to process different features gradually, and utilizes the Reinforcement Learning method and coarse-to-fine training strategy to further improve the generated results. Both quantitative and qualitative experiments on the benchmark datasets of MSVD and MSR-VTT show the effectiveness and feasibility of our framework. The codes are available on https://github.com/zhengyi123456/SMAN .

Biological system modeling

coarse-to-fine training

Context modeling

context-aware

Decoding

Feature extraction

Predictive models

reinforcement learning

stacked multimodal attention network

Training

Video captioning

Visualization

Details

Title: Subtitle: Stacked Multimodal Attention Network for Context-Aware Video Captioning
Creators: Yi Zheng - Fudan University
Yuejie Zhang - Fudan University
Rui Feng - Fudan University
Tao Zhang - Shanghai University of Finance and Economics
Weiguo Fan - University of Iowa
Resource Type: Journal article
Publication Details: IEEE transactions on circuits and systems for video technology, Vol.32(1), pp.31-42
Publisher: IEEE
DOI: 10.1109/TCSVT.2021.3058626
ISSN: 1051-8215
eISSN: 1558-2205
Grant note: 19ZR1417200 / Shanghai Natural Science Foundation (10.13039/100007219) 20511101203; 20511102702; 20511101403; 19DZ2205700; 2021SHZDZX0103 / Science and Technology Development Plan of Shanghai Science and Technology Commission 19YJA630116 / Humanities and Social Sciences Planning Fund of Ministry of Education of China (10.13039/501100013139) 61976057; 61572140 / National Natural Science Foundation of China (10.13039/501100001809)
Language: English
Date published: 01/2022
Academic Unit: Business Analytics
Record Identifier: 9984380476002771

Metrics

6 Record Views

17 Times Cited - Web of Science