Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment
- URL: http://arxiv.org/abs/2307.02682v2
- Date: Tue, 11 Jul 2023 04:10:49 GMT
- Title: Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment
- Authors: Yongrae Jo, Seongyun Lee, Aiden SJ Lee, Hyunji Lee, Hanseok Oh,
Minjoon Seo
- Abstract summary: We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
- Score: 10.567291051485194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense video captioning, a task of localizing meaningful moments and
generating relevant captions for videos, often requires a large, expensive
corpus of annotated video segments paired with text. In an effort to minimize
the annotation cost, we propose ZeroTA, a novel method for dense video
captioning in a zero-shot manner. Our method does not require any videos or
annotations for training; instead, it localizes and describes events within
each input video at test time by optimizing solely on the input. This is
accomplished by introducing a soft moment mask that represents a temporal
segment in the video and jointly optimizing it with the prefix parameters of a
language model. This joint optimization aligns a frozen language generation
model (i.e., GPT-2) with a frozen vision-language contrastive model (i.e.,
CLIP) by maximizing the matching score between the generated text and a moment
within the video. We also introduce a pairwise temporal IoU loss to let a set
of soft moment masks capture multiple distinct events within the video. Our
method effectively discovers diverse significant events within the video, with
the resulting captions appropriately describing these events. The empirical
results demonstrate that ZeroTA surpasses zero-shot baselines and even
outperforms the state-of-the-art few-shot method on the widely-used benchmark
ActivityNet Captions. Moreover, our method shows greater robustness compared to
supervised methods when evaluated in out-of-domain scenarios. This research
provides insight into the potential of aligning widely-used models, such as
language generation models and vision-language models, to unlock a new
capability: understanding temporal aspects of videos.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - Zero-Shot Video Captioning with Evolving Pseudo-Tokens [79.16706829968673]
We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model.
The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames.
Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge.
arXiv Detail & Related papers (2022-07-22T14:19:31Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.