CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2510.09266v1
- Date: Fri, 10 Oct 2025 11:05:37 GMT
- Title: CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation
- Authors: Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, Yidan Zhang, Jiang Zhong, Peijin Wang, Yingchao Feng,
- Abstract summary: Multimodal Retrieval-Augmented Generation (MRAG) enables Large Language Models (MLLMs) to generate responses with external multimodal evidence.<n>Existing benchmarks remain limited in modality coverage and format diversity.<n>We introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos.
- Score: 29.58444236508143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs
Related papers
- Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval [27.493644447594367]
MCMR (Multi-Conditional Multimodal Retrieval) is a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries.<n>It spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture.<n>We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities.
arXiv Detail & Related papers (2026-03-01T12:53:47Z) - MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs [61.70050081221131]
MVU-Eval is the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs.<n>Our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos.<n>These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics.
arXiv Detail & Related papers (2025-11-10T16:02:33Z) - CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning [11.478276629279526]
We present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously.<n>CVBench comprises 1,000 question-answer pairs spanning three tiers: cross-video object association, cross-video event association, and cross-video complex reasoning.<n>Built from five domain-diverse video clusters, the benchmark challenges models to synthesise information across dynamic visual contexts.
arXiv Detail & Related papers (2025-08-27T03:29:35Z) - AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z) - MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z) - Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering [42.468210353582755]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation [19.745059794932807]
We introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task.<n>We aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus.<n>To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics.
arXiv Detail & Related papers (2025-02-06T16:07:24Z) - InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [80.93387166769679]
We present IXC-2.5-Reward, a simple yet effective multi-modal reward model that aligns Large Vision Language Models with human preferences.<n>IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks.
arXiv Detail & Related papers (2025-01-21T18:47:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.