Diverse Video Captioning by Adaptive Spatio-temporal Attention
- URL: http://arxiv.org/abs/2208.09266v1
- Date: Fri, 19 Aug 2022 11:21:59 GMT
- Title: Diverse Video Captioning by Adaptive Spatio-temporal Attention
- Authors: Zohreh Ghaderi, Leonard Salewski, Hendrik P. A. Lensch
- Abstract summary: Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
- Score: 7.96569366755701
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To generate proper captions for videos, the inference needs to identify
relevant concepts and pay attention to the spatial relationships between them
as well as to the temporal development in the clip. Our end-to-end
encoder-decoder video captioning framework incorporates two transformer-based
architectures, an adapted transformer for a single joint spatio-temporal video
analysis as well as a self-attention-based decoder for advanced text
generation. Furthermore, we introduce an adaptive frame selection scheme to
reduce the number of required incoming frames while maintaining the relevant
content when training both transformers. Additionally, we estimate semantic
concepts relevant for video captioning by aggregating all ground truth captions
of each sample. Our approach achieves state-of-the-art results on the MSVD, as
well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering
multiple Natural Language Generation (NLG) metrics. Additional evaluations on
diversity scores highlight the expressiveness and diversity in the structure of
our generated captions.
Related papers
- TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses.
To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z) - Set Prediction Guided by Semantic Concepts for Diverse Video Captioning [47.89731738027379]
We formulate diverse captioning into a semantic-concept-guided set prediction problem.
We apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions.
The proposed model achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.
arXiv Detail & Related papers (2023-12-25T13:13:04Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Variational Stacked Local Attention Networks for Diverse Video
Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction.
We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - Visual-aware Attention Dual-stream Decoder for Video Captioning [12.139806877591212]
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.
This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.
We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment.
The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
arXiv Detail & Related papers (2021-10-16T14:08:20Z) - Domain Adaptive Video Segmentation via Temporal Consistency
Regularization [32.77436219094282]
This paper presents DA-VSN, a domain adaptive video segmentation network that addresses domain gaps in videos by temporal consistency regularization (TCR)
The first is cross-domain TCR that guides the prediction of target frames to have similar temporal consistency as that of source frames (learnt from annotated source data) via adversarial learning.
The second is intra-domain TCR that guides unconfident predictions of target frames to have similar temporal consistency as confident predictions of target frames.
arXiv Detail & Related papers (2021-07-23T02:50:42Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.