CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
- URL: http://arxiv.org/abs/2508.19542v2
- Date: Thu, 28 Aug 2025 14:22:38 GMT
- Title: CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
- Authors: Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, Xiongfei Yao, Shuaiwei Jiao,
- Abstract summary: We present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously.<n>CVBench comprises 1,000 question-answer pairs spanning three tiers: cross-video object association, cross-video event association, and cross-video complex reasoning.<n>Built from five domain-diverse video clusters, the benchmark challenges models to synthesise information across dynamic visual contexts.
- Score: 11.478276629279526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their ability across multiple videos remains critically underexplored. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to synthesise information across dynamic visual contexts. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 60% accuracy on causal reasoning tasks, compared to the 91% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLM architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for diagnosing and advancing multi-video reasoning, offering architectural insights for next-generation MLLMs. The data and evaluation code are available at https://github.com/Hokhim2/CVBench.
Related papers
- CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models [35.31011102734943]
Cross-Video Reasoning (CVR) requires simultaneous understanding of multiple videos to aggregate and compare information across groups of videos.<n>Most existing video understanding benchmarks focus on single-video analysis, failing to assess the ability of multimodal large language models (MLLMs) to simultaneously reason over various videos.<n>Recent benchmarks evaluate MLLMs' capabilities on multi-view videos that capture different perspectives of the same scene.<n>We introduce CrossVid, the first benchmark designed to comprehensively evaluate MLLMs' spatial-temporal reasoning ability in cross-video contexts.
arXiv Detail & Related papers (2025-11-15T15:41:38Z) - VideoSSR: Video Self-Supervised Reinforcement Learning [62.25888935329454]
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs)<n>This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data?
arXiv Detail & Related papers (2025-11-09T08:36:40Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.32948112203228]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z) - VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? [18.9270920369958]
Long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks.<n>Recent efforts have proposed benchmarks aimed at video reasoning, but tasks are often knowledge-driven and do not rely heavily on visual content.<n>We introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning.
arXiv Detail & Related papers (2025-05-29T11:33:43Z) - SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z) - RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video [19.373906873461703]
RTV-Bench is a fine-grained benchmark for MLLM real-time video analysis.<n>RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs.
arXiv Detail & Related papers (2025-05-04T10:55:21Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations.
VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.