MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors
- URL: http://arxiv.org/abs/2309.16701v4
- Date: Fri, 9 Aug 2024 00:53:10 GMT
- Title: MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors
- Authors: Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung,
- Abstract summary: We propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task.
It aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models.
For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets.
- Score: 24.858928681280634
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the explosion of multimedia content, video moment retrieval (VMR), which aims to detect a video moment that matches a given text query from a video, has been studied intensively as a critical problem. However, the existing VMR framework evaluates video moment retrieval performance, assuming that a video is given, which may not reveal whether the models exhibit overconfidence in the falsely given video. In this paper, we propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task that aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models. For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets using textual and visual semantic distance verification methods. We extend existing VMR datasets using these methods and newly construct three practical MVMR datasets. To solve the task, we further propose a strong informative sample-weighted learning method, CroCs, which employs two contrastive learning mechanisms: (1) weakly-supervised potential negative learning and (2) cross-directional hard-negative learning. Experimental results on the MVMR datasets reveal that existing VMR models are easily distracted by the misinformation (distractors), whereas our model shows significantly robust performance, demonstrating that CroCs is essential to distinguishing positive moments against distractors. Our code and datasets are publicly available: https://github.com/yny0506/Massive-Videos-Moment-Retrieval.
Related papers
- QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval [7.313447367245476]
Video Moment Retrieval (VMR) aims to retrieve relevant moments of an untrimmed video corresponding to the query.
We propose a novel model called QD-VMR, a query debiasing model with enhanced contextual understanding.
arXiv Detail & Related papers (2024-08-23T10:56:42Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Generative Video Diffusion for Unseen Cross-Domain Video Moment
Retrieval [58.17315970207874]
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships.
Existing methods resort to joint training on both source and target domain videos for cross-domain applications.
We explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences.
arXiv Detail & Related papers (2024-01-24T09:45:40Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Multi-video Moment Ranking with Multimodal Clue [69.81533127815884]
State-of-the-art work for VCMR is based on two-stage method.
MINUTE outperforms the baselines on TVR and DiDeMo datasets.
arXiv Detail & Related papers (2023-01-29T18:38:13Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - VLANet: Video-Language Alignment Network for Weakly-Supervised Video
Moment Retrieval [21.189093631175425]
Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query.
This paper explores methods for performing VMR in a weakly-supervised manner (wVMR)
The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.
arXiv Detail & Related papers (2020-08-24T07:54:59Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.