MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos
- URL: http://arxiv.org/abs/2502.12558v1
- Date: Tue, 18 Feb 2025 05:50:23 GMT
- Title: MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos
- Authors: Huaying Yuan, Jian Ni, Yueze Wang, Junjie Zhou, Zhengyang Liang, Zheng Liu, Zhao Cao, Zhicheng Dou, Ji-Rong Wen,
- Abstract summary: We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.
It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.
It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.
We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
- Score: 62.01402470874109
- License:
- Abstract: Retrieval augmented generation (RAG) holds great promise in addressing challenges associated with long video understanding. These methods retrieve useful moments from long videos for their presented tasks, thereby enabling multimodal large language models (MLLMs) to generate high-quality answers in a cost-effective way. In this work, we present MomentSeeker, a comprehensive benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval (LVMR) tasks. MomentSeeker offers three key advantages. First, it incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval. Second, it covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios (e.g., sports, movies, cartoons, and ego), making it a comprehensive tool for assessing retrieval models' general LVMR performance. Additionally, the evaluation tasks are carefully curated through human annotation, ensuring the reliability of assessment. We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark. We perform extensive experiments with various popular multimodal retrievers based on our benchmark, whose results highlight the challenges of LVMR and limitations for existing methods. Our created resources will be shared with community to advance future research in this field.
Related papers
- VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.
However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.
We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - SCBench: A Sports Commentary Benchmark for Video LLMs [19.13963551534595]
We develop a benchmark for sports video commentary generation for Video Large Language Models (Video LLMs)
$textbfSCBench$ is a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method.
Our results found InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.
arXiv Detail & Related papers (2024-12-23T15:13:56Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors [24.858928681280634]
We propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task.
It aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models.
For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets.
arXiv Detail & Related papers (2023-08-15T17:38:55Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.