Related papers: RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

URL: http://arxiv.org/abs/2512.02622v1
Date: Tue, 02 Dec 2025 10:29:51 GMT
Title: RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence
Authors: Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu,
Abstract summary: We introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules.<n>For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question.<n>Experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric.
Score: 24.51106324851909
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

Related papers

UniVBench: Towards Unified Evaluation for Video Foundation Models [29.73247324829126]
Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework.<n>We introduce UniVBench, a benchmark for evaluating video foundation models across four core abilities.<n>Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos.
arXiv Detail & Related papers (2026-02-25T12:08:53Z)
RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z)
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models [52.97290143922252]
V-ReasonBench is a benchmark designed to assess video reasoning across four key dimensions.<n> Evaluations of six state-of-the-art video models reveal clear dimension-wise differences.<n>Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning.
arXiv Detail & Related papers (2025-11-20T18:59:42Z)
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks [42.11140720884257]
Video models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics.<n>Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity.<n>We introduce VR-Bench -- a benchmark designed to systematically evaluate video models' reasoning capabilities.
arXiv Detail & Related papers (2025-11-19T03:18:29Z)
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models [42.763907973320464]
TiViBench is a hierarchical benchmark designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models.<n>We introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization.<n>Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models.
arXiv Detail & Related papers (2025-11-17T18:52:44Z)
VideoScore2: Think before You Score in Generative Video Evaluation [69.43069741467603]
VideoScore2 is a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency.<n>Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos.
arXiv Detail & Related papers (2025-09-26T18:09:03Z)
VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning [21.35520258725298]
VQ-Insight is a novel reasoning-style framework for AIGC video quality assessment.<n>It combines image quality warm-up, general task-specific temporal learning, and joint optimization with the video generation model.<n>It consistently outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring.
arXiv Detail & Related papers (2025-06-23T12:20:14Z)
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness [74.17234924159108]
We introduce VBench-2.0, a benchmark designed to evaluate video generative models for intrinsic faithfulness.<n>VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense.<n>We conduct extensive human annotations to ensure evaluation alignment with human judgment.
arXiv Detail & Related papers (2025-03-27T17:57:01Z)
Enhance-A-Video: Better Generated Video for Free [57.620595159855064]
We introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos.<n>Our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning.
arXiv Detail & Related papers (2025-02-11T12:22:35Z)
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models [81.84810348214113]
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. This paper proposes textitVideo-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.
arXiv Detail & Related papers (2023-11-27T18:59:58Z)
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.