Related papers: MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos

MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos

URL: http://arxiv.org/abs/2502.12558v2
Date: Mon, 10 Mar 2025 05:34:20 GMT
Title: MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos
Authors: Huaying Yuan, Jian Ni, Yueze Wang, Junjie Zhou, Zhengyang Liang, Zheng Liu, Zhao Cao, Zhicheng Dou, Ji-Rong Wen,
Abstract summary: We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.<n>It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.<n>It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.<n>We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
Score: 62.01402470874109
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval augmented generation (RAG) holds great promise in addressing challenges associated with long video understanding. These methods retrieve useful moments from long videos for their presented tasks, thereby enabling multimodal large language models (MLLMs) to generate high-quality answers in a cost-effective way. In this work, we present MomentSeeker, a comprehensive benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval (LVMR) tasks. MomentSeeker offers three key advantages. First, it incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval. Second, it covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios (e.g., sports, movies, cartoons, and ego), making it a comprehensive tool for assessing retrieval models' general LVMR performance. Additionally, the evaluation tasks are carefully curated through human annotation, ensuring the reliability of assessment. We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark. We perform extensive experiments with various popular multimodal retrievers based on our benchmark, whose results highlight the challenges of LVMR and limitations for existing methods. Our created resources will be shared with community to advance future research in this field.

Related papers

HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
TextVidBench: A Benchmark for Long Video Scene Text Understanding [60.94150574231576]
We introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes)<n>TextVidBench makes three key contributions: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding.<n>We propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on
arXiv Detail & Related papers (2025-06-05T12:54:56Z)
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation [31.48914479058998]
We introduce Longtextbf-RVOS, a large-scale benchmark for long-term referring object segmentation.<n>Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects.<n>Unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two metrics to assess the temporal andtemporal consistency.
arXiv Detail & Related papers (2025-05-19T04:52:31Z)
Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering [36.94781787191615]
We propose a simple yet effective approach with active moment discovering (AMDNet) We are committed to discovering video moments that are semantically consistent with their queries. Experiments on two large-scale video datasets demonstrate the superiority and efficiency of our AMDNet.
arXiv Detail & Related papers (2025-04-15T07:00:18Z)
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding [25.111988967973147]
Existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. We propose a hierarchical and holistic video understanding benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features: extended video duration, comprehensive assessment tasks, andEnriched video data.
arXiv Detail & Related papers (2025-03-31T12:32:51Z)
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs [8.18451834099348]
Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer. Our experiments show FALCONEye's superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.
arXiv Detail & Related papers (2025-03-25T17:17:19Z)
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [20.184894298462652]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models.<n>HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA)<n>We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z)
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.<n>However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.<n>We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z)
SCBench: A Sports Commentary Benchmark for Video LLMs [19.13963551534595]
We develop a benchmark for sports video commentary generation for Video Large Language Models (Video LLMs)<n>$textbfSCBench$ is a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method.<n>Our results found InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.
arXiv Detail & Related papers (2024-12-23T15:13:56Z)
The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z)
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
MLVU: Benchmarking Multi-task Long Video Understanding [28.35597611731375]
We propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU.<n> MLVU presents the following critical values: textit1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations.<n>The empirical study with 23 latest MLLMs reveals significant room for improvement in today's technique.
arXiv Detail & Related papers (2024-06-06T17:09:32Z)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
TempCompass: Do Video LLMs Really Understand Videos? [36.28973015469766]
Existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. We propose the textbfTemp benchmark, which introduces a diversity of high-quality temporal aspects and task formats.
arXiv Detail & Related papers (2024-03-01T12:02:19Z)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench. We first introduce a novel static-to-dynamic method to define these temporal-related tasks. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors [24.858928681280634]
We propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task. It aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models. For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets.
arXiv Detail & Related papers (2023-08-15T17:38:55Z)
Video-based Person Re-identification with Long Short-Term Representation Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras. We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z)
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.