DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
- URL: http://arxiv.org/abs/2111.10146v1
- Date: Fri, 19 Nov 2021 10:46:45 GMT
- Title: DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
- Authors: Xu Yan, Zhengcong Fei, Shuhui Wang, Qingming Huang, Qi Tian
- Abstract summary: Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
- Score: 163.71539565491113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense video captioning (DVC) aims to generate multi-sentence descriptions to
elucidate the multiple events in the video, which is challenging and demands
visual consistency, discoursal coherence, and linguistic diversity. Existing
methods mainly generate captions from individual video segments, lacking
adaptation to the global visual context and progressive alignment between the
fast-evolved visual content and textual descriptions, which results in
redundant and spliced descriptions. In this paper, we introduce the concept of
information flow to model the progressive information changing across video
sequence and captions. By designing a Cross-modal Information Flow Alignment
mechanism, the visual and textual information flows are captured and aligned,
which endows the captioning process with richer context and dynamics on
event/topic evolution. Based on the Cross-modal Information Flow Alignment
module, we further put forward DVCFlow framework, which consists of a
Global-local Visual Encoder to capture both global features and local features
for each video segment, and a pre-trained Caption Generator to produce
captions. Extensive experiments on the popular ActivityNet Captions and
YouCookII datasets demonstrate that our method significantly outperforms
competitive baselines, and generates more human-like text according to subject
and objective tests.
Related papers
- Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss.
Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and
Multi-Head Decoding for Dense Video Captioning [46.69503728433432]
We present a semantic-assisted dense video captioning model based on the encoding-decoding framework.
Our method achieves significant improvements on the YouMakeup dataset under evaluation.
arXiv Detail & Related papers (2022-07-06T10:56:53Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Variational Stacked Local Attention Networks for Diverse Video
Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction.
We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z) - Referring Segmentation in Images and Videos with Cross-Modal
Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video.
gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features.
Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.