When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions
- URL: http://arxiv.org/abs/2510.17218v1
- Date: Mon, 20 Oct 2025 07:01:16 GMT
- Title: When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions
- Authors: Zhuo Cao, Heming Du, Bingqing Zhang, Xin Yu, Xue Li, Sen Wang,
- Abstract summary: Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR)<n>This makes the existing datasets and methods insufficient for video temporal grounding.<n>We introduce a high-quality datasets called QVHighlights Multi-Moment dataset (QV-M$2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR)
- Score: 20.739538870657913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.
Related papers
- Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs [21.346216484639225]
Video-MSR is the first benchmark designed to evaluate Multi-hop Spatial Reasoning in dynamic video scenarios.<n>Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline.<n>Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research.
arXiv Detail & Related papers (2026-01-14T12:24:47Z) - MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z) - Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness [61.87055159919641]
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities.<n>Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality.<n>We introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM)
arXiv Detail & Related papers (2025-03-24T08:46:52Z) - Composed Multi-modal Retrieval: A Survey of Approaches and Applications [81.54640206021757]
Composed Multi-modal Retrieval (CMR) emerges as a pivotal next-generation technology.<n>CMR enables users to query images or videos by integrating a reference visual input with textual modifications.<n>This paper provides a comprehensive survey of CMR, covering its fundamental challenges, technical advancements, and applications.
arXiv Detail & Related papers (2025-03-03T09:18:43Z) - TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries [46.492091661862034]
We propose the task of textitRanked Video Moment Retrieval (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language.
We develop the TVR-Ranking dataset, based on the raw videos and existing moment annotations provided in the TVR dataset.
Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search.
arXiv Detail & Related papers (2024-07-09T06:57:30Z) - Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering [19.351516992903697]
We propose emphMixture of Rationales (MoR), a novel multi-modal reasoning method that mixes multiple rationales for zero-shot visual question answering.
MoR achieves a 12.43% accuracy improvement on NLVR2, and a 2.45% accuracy improvement on OKVQA-S.
arXiv Detail & Related papers (2024-06-03T15:04:47Z) - Faster Video Moment Retrieval with Point-Level Supervision [70.51822333023145]
Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an untrimmed video with natural language queries.
Existing VMR methods suffer from two defects: massive expensive temporal annotations and complicated cross-modal interaction modules.
We propose a novel method termed Cheaper and Faster Moment Retrieval (CFMR)
arXiv Detail & Related papers (2023-05-23T12:53:50Z) - AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval [47.665259947270336]
We propose an alternative measure for evaluating Video Moment Retrieval (VMR)
We show that AxIoU satisfies two important axioms for VMR evaluation.
We also empirically examine how AxIoU agrees with R@$K,theta$, as well as its stability with respect to change in the test data and human-annotated temporal boundaries.
arXiv Detail & Related papers (2022-03-30T05:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.