Related papers: Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

URL: http://arxiv.org/abs/2511.15065v1
Date: Wed, 19 Nov 2025 03:18:29 GMT
Title: Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Authors: Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu,
Abstract summary: Video models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics.<n>Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity.<n>We introduce VR-Bench -- a benchmark designed to systematically evaluate video models' reasoning capabilities.
Score: 42.11140720884257
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.

Related papers

Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning [18.15310805625469]
We present Know-Show, a new benchmark designed to evaluate multimodal Video-Language Models (Video-LMs)<n>Know-Show unifies reasoning and localization within a single evaluation framework consisting of five scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions.<n>Built from Charades, Action Genome, and Ego4D with 2.5K human-language questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning.<n>To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding
arXiv Detail & Related papers (2025-12-05T08:15:49Z)
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence [24.51106324851909]
We introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules.<n>For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question.<n>Experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric.
arXiv Detail & Related papers (2025-12-02T10:29:51Z)
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models [52.97290143922252]
V-ReasonBench is a benchmark designed to assess video reasoning across four key dimensions.<n> Evaluations of six state-of-the-art video models reveal clear dimension-wise differences.<n>Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning.
arXiv Detail & Related papers (2025-11-20T18:59:42Z)
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models [42.763907973320464]
TiViBench is a hierarchical benchmark designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models.<n>We introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization.<n>Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models.
arXiv Detail & Related papers (2025-11-17T18:52:44Z)
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
arXiv Detail & Related papers (2025-10-30T17:59:55Z)
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.15756345836901]
We provide a comprehensive review of multimodal spatial reasoning tasks with large models.<n>We review advances in embodied AI, including vision-language navigation and action models.<n>We consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors.
arXiv Detail & Related papers (2025-10-29T17:55:43Z)
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [39.6349428129868]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.<n>With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning.
arXiv Detail & Related papers (2025-08-06T13:03:21Z)
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks [41.90092896728809]
We present VidBridge-R1, the first versatile video reasoning model that effectively bridges the "Reason-Then-Respond" paradigm conflict.<n>Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model.
arXiv Detail & Related papers (2025-06-10T03:57:53Z)
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? [18.9270920369958]
Long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks.<n>Recent efforts have proposed benchmarks aimed at video reasoning, but tasks are often knowledge-driven and do not rely heavily on visual content.<n>We introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning.
arXiv Detail & Related papers (2025-05-29T11:33:43Z)
Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.<n>By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.<n> Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z)
Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings. We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.