Visual-aware Attention Dual-stream Decoder for Video Captioning
- URL: http://arxiv.org/abs/2110.08578v1
- Date: Sat, 16 Oct 2021 14:08:20 GMT
- Title: Visual-aware Attention Dual-stream Decoder for Video Captioning
- Authors: Zhixin Sun, Xian Zhong, Shuqin Chen, Lin Li, and Luo Zhong
- Abstract summary: The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.
This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.
We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment.
The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
- Score: 12.139806877591212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning is a challenging task that captures different visual parts
and describes them in sentences, for it requires visual and linguistic
coherence. The attention mechanism in the current video captioning method
learns to assign weight to each frame, promoting the decoder dynamically. This
may not explicitly model the correlation and the temporal coherence of the
visual features extracted in the sequence frames.To generate semantically
coherent sentences, we propose a new Visual-aware Attention (VA) model, which
concatenates dynamic changes of temporal sequence frames with the words at the
previous moment, as the input of attention mechanism to extract sequence
features.In addition, the prevalent approaches widely use the teacher-forcing
(TF) learning during training, where the next token is generated conditioned on
the previous ground-truth tokens. The semantic information in the previously
generated tokens is lost. Therefore, we design a self-forcing (SF) stream that
takes the semantic information in the probability distribution of the previous
token as input to enhance the current token.The Dual-stream Decoder (DD)
architecture unifies the TF and SF streams, generating sentences to promote the
annotated captioning for both streams.Meanwhile, with the Dual-stream Decoder
utilized, the exposure bias problem is alleviated, caused by the discrepancy
between the training and testing in the TF learning.The effectiveness of the
proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated
through the result of experimental studies on Microsoft video description
(MSVD) corpus and MSR-Video to text (MSR-VTT) datasets.
Related papers
- Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Conditional Variational Autoencoder for Sign Language Translation with
Cross-Modal Alignment [33.96363443363547]
Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences.
We propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT)
CV-SLT consists of two paths with two Kullback-Leibler divergences to regularize the outputs of the encoder and decoder.
arXiv Detail & Related papers (2023-12-25T08:20:40Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval [43.2299969152561]
Methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
arXiv Detail & Related papers (2022-04-26T16:06:31Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Variational Stacked Local Attention Networks for Diverse Video
Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction.
We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.