Localizing Events in Videos with Multimodal Queries
- URL: http://arxiv.org/abs/2406.10079v3
- Date: Thu, 21 Nov 2024 17:58:55 GMT
- Title: Localizing Events in Videos with Multimodal Queries
- Authors: Gengyuan Zhang, Mang Ling Ada Fok, Jialu Ma, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu,
- Abstract summary: Localizing events in videos based on semantic queries is a pivotal task in video understanding.
We introduce ICQ, a new benchmark designed for localizing events in videos with multimodal queries.
We propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy.
- Score: 61.20556229245365
- License:
- Abstract: Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries -- especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy. ICQ systematically benchmarks 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse application domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization.
Related papers
- Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models [21.892975397847316]
We present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index.
One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities.
The system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques.
arXiv Detail & Related papers (2024-07-05T02:01:49Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.