Localizing Moments in Long Video Via Multimodal Guidance
- URL: http://arxiv.org/abs/2302.13372v2
- Date: Sun, 15 Oct 2023 13:48:59 GMT
- Title: Localizing Moments in Long Video Via Multimodal Guidance
- Authors: Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian
Caba Heilbron and Bernard Ghanem
- Abstract summary: We propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows.
Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ)
- Score: 51.72829274071017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent introduction of the large-scale, long-form MAD and Ego4D datasets
has enabled researchers to investigate the performance of current
state-of-the-art methods for video grounding in the long-form setup, with
interesting findings: current grounding methods alone fail at tackling this
challenging task and setup due to their inability to process long video
sequences. In this paper, we propose a method for improving the performance of
natural language grounding in long videos by identifying and pruning out
non-describable windows. We design a guided grounding framework consisting of a
Guidance Model and a base grounding model. The Guidance Model emphasizes
describable windows, while the base grounding model analyzes short temporal
windows to determine which segments accurately match a given language query. We
offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent,
which balance efficiency and accuracy. Experiments demonstrate that our
proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in
Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to
reproduce our experiments are available at:
https://github.com/waybarrios/guidance-based-video-grounding.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation [20.88042649759396]
We propose a novel diffusion-based long video generation method with a shared noise modeling mechanism across the multi-views to increase spatial consistency.
Our method can generate up to 40 frames of video without loss of consistency which is about 5 times longer compared with state-of-the-art methods.
Our framework is able to go beyond perception and prediction tasks, for the first time, boost the planning performance of the end-to-end autonomous driving model by a margin of 25%.
arXiv Detail & Related papers (2024-06-03T14:13:13Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - PGT: A Progressive Method for Training Models on Long Videos [45.935259079953255]
Main-stream method is to split a raw video into clips, leading to incomplete temporal information flow.
Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property.
We empirically demonstrate that it yields significant performance improvements on different models and datasets.
arXiv Detail & Related papers (2021-03-21T06:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.