V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
- URL: http://arxiv.org/abs/2511.16668v1
- Date: Thu, 20 Nov 2025 18:59:42 GMT
- Title: V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
- Authors: Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You,
- Abstract summary: V-ReasonBench is a benchmark designed to assess video reasoning across four key dimensions.<n> Evaluations of six state-of-the-art video models reveal clear dimension-wise differences.<n>Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning.
- Score: 52.97290143922252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
Related papers
- UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z) - RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z) - A Mechanistic View on Video Generation as World Models: State and Dynamics [43.951972667861575]
This work proposes a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling.<n>By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.
arXiv Detail & Related papers (2026-01-22T19:00:18Z) - RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence [24.51106324851909]
We introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules.<n>For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question.<n>Experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric.
arXiv Detail & Related papers (2025-12-02T10:29:51Z) - Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks [42.11140720884257]
Video models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics.<n>Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity.<n>We introduce VR-Bench -- a benchmark designed to systematically evaluate video models' reasoning capabilities.
arXiv Detail & Related papers (2025-11-19T03:18:29Z) - TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models [42.763907973320464]
TiViBench is a hierarchical benchmark designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models.<n>We introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization.<n>Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models.
arXiv Detail & Related papers (2025-11-17T18:52:44Z) - Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
arXiv Detail & Related papers (2025-10-30T17:59:55Z) - Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models [14.187604603759784]
We present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of text-to-video systems.<n>For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline.<n> PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
arXiv Detail & Related papers (2025-07-21T17:30:46Z) - VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks [41.90092896728809]
We present VidBridge-R1, the first versatile video reasoning model that effectively bridges the "Reason-Then-Respond" paradigm conflict.<n>Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model.
arXiv Detail & Related papers (2025-06-10T03:57:53Z) - VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness [74.17234924159108]
We introduce VBench-2.0, a benchmark designed to evaluate video generative models for intrinsic faithfulness.<n>VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense.<n>We conduct extensive human annotations to ensure evaluation alignment with human judgment.
arXiv Detail & Related papers (2025-03-27T17:57:01Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.