Variational Stacked Local Attention Networks for Diverse Video
Captioning
- URL: http://arxiv.org/abs/2201.00985v1
- Date: Tue, 4 Jan 2022 05:14:34 GMT
- Title: Variational Stacked Local Attention Networks for Diverse Video
Captioning
- Authors: Tonmoay Deb, Akib Sadmanee, Kishor Kumar Bhaumik, Amin Ahsan Ali, M
Ashraful Amin, A K M Mahbubur Rahman
- Abstract summary: Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction.
We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
- Score: 2.492343817244558
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While describing Spatio-temporal events in natural language, video captioning
models mostly rely on the encoder's latent visual representation. Recent
progress on the encoder-decoder model attends encoder features mainly in linear
interaction with the decoder. However, growing model complexity for visual data
encourages more explicit feature interaction for fine-grained information,
which is currently absent in the video captioning domain. Moreover, feature
aggregations methods have been used to unveil richer visual representation,
either by the concatenation or using a linear layer. Though feature sets for a
video semantically overlap to some extent, these approaches result in objective
mismatch and feature redundancy. In addition, diversity in captions is a
fundamental component of expressing one event from several meaningful
perspectives, currently missing in the temporal, i.e., video captioning domain.
To this end, we propose Variational Stacked Local Attention Network (VSLAN),
which exploits low-rank bilinear pooling for self-attentive feature interaction
and stacking multiple video feature streams in a discount fashion. Each feature
stack's learned attributes contribute to our proposed diversity encoding
module, followed by the decoding query stage to facilitate end-to-end diverse
and natural captions without any explicit supervision on attributes. We
evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
The CIDEr score of VSLAN outperforms current off-the-shelf methods by $7.8\%$
on MSVD and $4.5\%$ on MSR-VTT, respectively. On the same datasets, VSLAN
achieves competitive results in caption diversity metrics.
Related papers
- EVC-MF: End-to-end Video Captioning Network with Multi-scale Features [13.85795110061781]
We propose an end-to-end encoder-decoder-based network (EVC-MF) for video captioning.
It efficiently utilizes multi-scale visual and textual features to generate video descriptions.
The results demonstrate that EVC-MF yields competitive performance compared with the state-of-theart methods.
arXiv Detail & Related papers (2024-10-22T02:16:02Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video.
gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features.
Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.