Language-free Training for Zero-shot Video Grounding
- URL: http://arxiv.org/abs/2210.12977v1
- Date: Mon, 24 Oct 2022 06:55:29 GMT
- Title: Language-free Training for Zero-shot Video Grounding
- Authors: Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn
- Abstract summary: Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
- Score: 50.701372436100684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an untrimmed video and a language query depicting a specific temporal
moment in the video, video grounding aims to localize the time interval by
understanding the text and video simultaneously. One of the most challenging
issues is an extremely time- and cost-consuming annotation collection,
including video captions in a natural language form and their corresponding
temporal regions. In this paper, we present a simple yet novel training
framework for video grounding in the zero-shot setting, which learns a network
with only video data without any annotation. Inspired by the recent
language-free paradigm, i.e. training without language data, we train the
network without compelling the generation of fake (pseudo) text queries into a
natural language form. Specifically, we propose a method for learning a video
grounding model by selecting a temporal interval as a hypothetical correct
answer and considering the visual feature selected by our method in the
interval as a language feature, with the help of the well-aligned
visual-language space of CLIP. Extensive experiments demonstrate the prominence
of our language-free training framework, outperforming the existing zero-shot
video grounding method and even several weakly-supervised approaches with large
margins on two standard datasets.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Contrastive Language Video Time Pre-training [12.876308881183371]
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning.
Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features.
We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-06-04T02:48:59Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Zero-shot Natural Language Video Localization [11.522385805128001]
We make a first attempt to train a natural language video localization model in zero-shot manner.
Inspired by unsupervised image captioning setup, we merely require random text corpora, unlabeled video collections, and an off-the-shelf object detector to train a model.
arXiv Detail & Related papers (2021-08-29T13:21:50Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.