Related papers: DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

URL: http://arxiv.org/abs/2111.10146v1
Date: Fri, 19 Nov 2021 10:46:45 GMT
Title: DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Authors: Xu Yan, Zhengcong Fei, Shuhui Wang, Qingming Huang, Qi Tian
Abstract summary: Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context. We introduce the concept of information flow to model the progressive information changing across video sequence and captions. Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
Score: 163.71539565491113
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dense video captioning (DVC) aims to generate multi-sentence descriptions to elucidate the multiple events in the video, which is challenging and demands visual consistency, discoursal coherence, and linguistic diversity. Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context and progressive alignment between the fast-evolved visual content and textual descriptions, which results in redundant and spliced descriptions. In this paper, we introduce the concept of information flow to model the progressive information changing across video sequence and captions. By designing a Cross-modal Information Flow Alignment mechanism, the visual and textual information flows are captured and aligned, which endows the captioning process with richer context and dynamics on event/topic evolution. Based on the Cross-modal Information Flow Alignment module, we further put forward DVCFlow framework, which consists of a Global-local Visual Encoder to capture both global features and local features for each video segment, and a pre-trained Caption Generator to produce captions. Extensive experiments on the popular ActivityNet Captions and YouCookII datasets demonstrate that our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.

Related papers

VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z)
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation [118.5096631571738]
We present Any2Caption, a novel framework for controllable video generation under any condition. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models.
arXiv Detail & Related papers (2025-03-31T17:59:01Z)
Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss. Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z)
OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. This enables us to address various types of video tasks, including classification, captioning, and localization. We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR) Existing methods rely on separate pre-training feature extractors for visual and textual understanding. We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning. Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z)
PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning [46.69503728433432]
We present a semantic-assisted dense video captioning model based on the encoding-decoding framework. Our method achieves significant improvements on the YouMakeup dataset under evaluation.
arXiv Detail & Related papers (2022-07-06T10:56:53Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
Variational Stacked Local Attention Networks for Diverse Video Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction. We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z)
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network [27.792054915363106]
Cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video. gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features. Cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames.
arXiv Detail & Related papers (2021-02-09T11:27:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.