Related papers: V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

URL: http://arxiv.org/abs/2511.16668v1
Date: Thu, 20 Nov 2025 18:59:42 GMT
Title: V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Authors: Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You,
Abstract summary: V-ReasonBench is a benchmark designed to assess video reasoning across four key dimensions.<n> Evaluations of six state-of-the-art video models reveal clear dimension-wise differences.<n>Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning.
Score: 52.97290143922252
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

Related papers

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z)
RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z)
A Mechanistic View on Video Generation as World Models: State and Dynamics [43.951972667861575]
This work proposes a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling.<n>By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.
arXiv Detail & Related papers (2026-01-22T19:00:18Z)
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence [24.51106324851909]
We introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules.<n>For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question.<n>Experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric.
arXiv Detail & Related papers (2025-12-02T10:29:51Z)
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks [42.11140720884257]
Video models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics.<n>Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity.<n>We introduce VR-Bench -- a benchmark designed to systematically evaluate video models' reasoning capabilities.
arXiv Detail & Related papers (2025-11-19T03:18:29Z)
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models [42.763907973320464]
TiViBench is a hierarchical benchmark designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models.<n>We introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization.<n>Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models.
arXiv Detail & Related papers (2025-11-17T18:52:44Z)
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
arXiv Detail & Related papers (2025-10-30T17:59:55Z)
Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models [14.187604603759784]
We present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of text-to-video systems.<n>For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline.<n> PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
arXiv Detail & Related papers (2025-07-21T17:30:46Z)
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks [41.90092896728809]
We present VidBridge-R1, the first versatile video reasoning model that effectively bridges the "Reason-Then-Respond" paradigm conflict.<n>Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model.
arXiv Detail & Related papers (2025-06-10T03:57:53Z)
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness [74.17234924159108]
We introduce VBench-2.0, a benchmark designed to evaluate video generative models for intrinsic faithfulness.<n>VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense.<n>We conduct extensive human annotations to ensure evaluation alignment with human judgment.
arXiv Detail & Related papers (2025-03-27T17:57:01Z)
VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.