When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
- URL: http://arxiv.org/abs/2602.08236v1
- Date: Mon, 09 Feb 2026 03:21:48 GMT
- Title: When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
- Authors: Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal,
- Abstract summary: We present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning.<n>We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency.
- Score: 108.73849507002195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.
Related papers
- Internal World Models as Imagination Networks in Cognitive Agents [0.0]
We propose that imagination serves to access an internal world model (IWM) and use psychological network analysis to explore IWMs in humans and large language models (LLMs)<n>Our study offers a novel method for comparing internally-generated representations in humans and AI, providing insights for developing human-like imagination in artificial intelligence.
arXiv Detail & Related papers (2025-10-05T23:01:10Z) - Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT [24.085953089267772]
We show how OpenAI o3 and GPT-4o fail to grasp basic physical laws, spatial interactions, and causal effects in complex scenes.<n>We introduce MVPBench, a benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT)<n> Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains.
arXiv Detail & Related papers (2025-05-30T03:48:59Z) - Grounded Reinforcement Learning for Visual Reasoning [51.94871616778874]
We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with reinforcement learning.<n>Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces.<n>Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
arXiv Detail & Related papers (2025-05-29T17:20:26Z) - VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search [89.43196232124883]
VisuoThink is a novel framework that seamlessly integrates visuospatial and linguistic domains.<n>It enables progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search.
arXiv Detail & Related papers (2025-04-12T08:37:30Z) - Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models [58.64449765678416]
We introduce landscape of thoughts (LoT) to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset.<n>LoT distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks.<n>We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories.
arXiv Detail & Related papers (2025-03-28T06:09:51Z) - Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models [40.41276154014666]
We present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge.
The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models.
Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading with 40% accuracy.
arXiv Detail & Related papers (2024-07-28T11:56:03Z) - Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models [37.44286562901589]
We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning.
We conduct a comprehensive evaluation of competitive language and vision-language models.
Our findings reveal several counter-intuitive insights that have been overlooked in the literature.
arXiv Detail & Related papers (2024-06-21T03:53:37Z) - Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks.
We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces.
VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.