Related papers: Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

URL: http://arxiv.org/abs/2511.13853v1
Date: Mon, 17 Nov 2025 19:11:39 GMT
Title: Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
Authors: Xinxin Liu, Zhaopan Xu, Kai Wang, Yong Jae Lee, Yuzhang Shang,
Abstract summary: Video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning.<n>Existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning.<n>We introduce Gen-ViRe, a framework grounded in cognitive science and real-world AI applications.
Score: 48.02995109011304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.

Related papers

RISE-Video: Can Video Generators Decode Implicit World Rules? [71.92434352963427]
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis.<n>RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories.<n>We propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
arXiv Detail & Related papers (2026-02-05T18:36:10Z)
PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation [7.0748516420242495]
PRiSM is a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code.<n> PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent.<n>We propose five targeted evaluation tasks covering perturbation, symbolic program synthesis, robustness, reasoning correction, and ambiguity resolution.
arXiv Detail & Related papers (2025-12-05T18:14:55Z)
Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning [6.800544911407401]
GRiP (Guided Reasoning and Perception) is a novel training framework for visual grounded reasoning.<n> GRiP cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways.<n> GRiP achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench.
arXiv Detail & Related papers (2025-11-27T07:18:25Z)
LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
arXiv Detail & Related papers (2025-10-30T17:59:55Z)
Explain Before You Answer: A Survey on Compositional Visual Reasoning [74.27548620675748]
Compositional visual reasoning has emerged as a key research frontier in multimodal AI.<n>This survey systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.)<n>We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception.
arXiv Detail & Related papers (2025-08-24T11:01:51Z)
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models [10.1080193179562]
Current understanding models excel at recognizing "what" but fall short in high-level cognitive tasks like causal reasoning and future prediction.<n>We propose a novel framework that fuses a powerful Vision Foundation Model for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core.
arXiv Detail & Related papers (2025-07-08T09:43:17Z)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT [24.085953089267772]
We show how OpenAI o3 and GPT-4o fail to grasp basic physical laws, spatial interactions, and causal effects in complex scenes.<n>We introduce MVPBench, a benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT)<n> Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains.
arXiv Detail & Related papers (2025-05-30T03:48:59Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance [41.6755826072905]
In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories.<n>Existing vision-language models often underperform in real-world applications because of sub-optimal prompt engineering.<n>We propose a Concept-guided Human-like Bayesian Reasoning framework to address these issues.
arXiv Detail & Related papers (2025-03-20T06:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.