Zero-shot Natural Language Video Localization
- URL: http://arxiv.org/abs/2110.00428v1
- Date: Sun, 29 Aug 2021 13:21:50 GMT
- Title: Zero-shot Natural Language Video Localization
- Authors: Jinwoo Nam and Daechul Ahn and Dongyeop Kang and Seong Jong Ha and
Jonghyun Choi
- Abstract summary: We make a first attempt to train a natural language video localization model in zero-shot manner.
Inspired by unsupervised image captioning setup, we merely require random text corpora, unlabeled video collections, and an off-the-shelf object detector to train a model.
- Score: 11.522385805128001
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding videos to localize moments with natural language often requires
large expensive annotated video regions paired with language queries. To
eliminate the annotation costs, we make a first attempt to train a natural
language video localization model in zero-shot manner. Inspired by unsupervised
image captioning setup, we merely require random text corpora, unlabeled video
collections, and an off-the-shelf object detector to train a model. With the
unpaired data, we propose to generate pseudo-supervision of candidate temporal
regions and corresponding query sentences, and develop a simple NLVL model to
train with the pseudo-supervision. Our empirical validations show that the
proposed pseudo-supervised method outperforms several baseline approaches and a
number of methods using stronger supervision on Charades-STA and
ActivityNet-Captions.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - Unsupervised Open-Vocabulary Object Localization in Videos [118.32792460772332]
We show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization.
We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots.
arXiv Detail & Related papers (2023-09-18T15:20:13Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Refined Vision-Language Modeling for Fine-grained Multi-modal
Pre-training [12.760340242744313]
Fine-grained supervision based on object annotations has been widely used for vision and language pre-training.
In real-world application scenarios, aligned multi-modal data is usually in the image-caption format, which only provides coarse-grained supervision.
arXiv Detail & Related papers (2023-03-09T15:01:12Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.