MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval
- URL: http://arxiv.org/abs/2502.12558v4
- Date: Tue, 20 May 2025 03:30:44 GMT
- Title: MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval
- Authors: Huaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao, Zhicheng Dou, Ji-Rong Wen,
- Abstract summary: We propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR)<n>MomentSeeker is created based on long and diverse videos, averaging over 1200 seconds in duration.<n>It covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning.
- Score: 61.414236415351446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(https://yhy-2000.github.io/MomentSeeker/) to facilitate future research in this area.
Related papers
- HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - TextVidBench: A Benchmark for Long Video Scene Text Understanding [60.94150574231576]
We introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes)<n>TextVidBench makes three key contributions: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding.<n>We propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on
arXiv Detail & Related papers (2025-06-05T12:54:56Z) - Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation [31.48914479058998]
We introduce Longtextbf-RVOS, a large-scale benchmark for long-term referring object segmentation.<n>Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects.<n>Unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two metrics to assess the temporal andtemporal consistency.
arXiv Detail & Related papers (2025-05-19T04:52:31Z) - Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering [36.94781787191615]
We propose a simple yet effective approach with active moment discovering (AMDNet)
We are committed to discovering video moments that are semantically consistent with their queries.
Experiments on two large-scale video datasets demonstrate the superiority and efficiency of our AMDNet.
arXiv Detail & Related papers (2025-04-15T07:00:18Z) - H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding [25.111988967973147]
Existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability.
We propose a hierarchical and holistic video understanding benchmark designed to evaluate both general video and online streaming video comprehension.
This benchmark contributes three key features: extended video duration, comprehensive assessment tasks, andEnriched video data.
arXiv Detail & Related papers (2025-03-31T12:32:51Z) - FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs [8.18451834099348]
Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer.
Our experiments show FALCONEye's superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.
arXiv Detail & Related papers (2025-03-25T17:17:19Z) - HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [20.184894298462652]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models.<n>HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA)<n>We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.<n>However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.<n>We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - SCBench: A Sports Commentary Benchmark for Video LLMs [19.13963551534595]
We develop a benchmark for sports video commentary generation for Video Large Language Models (Video LLMs)<n>$textbfSCBench$ is a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method.<n>Our results found InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.
arXiv Detail & Related papers (2024-12-23T15:13:56Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - MLVU: Benchmarking Multi-task Long Video Understanding [28.35597611731375]
We propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU.<n> MLVU presents the following critical values: textit1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations.<n>The empirical study with 23 latest MLLMs reveals significant room for improvement in today's technique.
arXiv Detail & Related papers (2024-06-06T17:09:32Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - TempCompass: Do Video LLMs Really Understand Videos? [36.28973015469766]
Existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs.
We propose the textbfTemp benchmark, which introduces a diversity of high-quality temporal aspects and task formats.
arXiv Detail & Related papers (2024-03-01T12:02:19Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors [24.858928681280634]
We propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task.
It aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models.
For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets.
arXiv Detail & Related papers (2023-08-15T17:38:55Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.