SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery
- URL: http://arxiv.org/abs/2512.07733v1
- Date: Mon, 08 Dec 2025 17:20:50 GMT
- Title: SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery
- Authors: Meng Cao, Xingyu Li, Xue Liu, Ian Reid, Xiaodan Liang,
- Abstract summary: SpatialDreamer is a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration.<n>GeoPO introduces tree-structured sampling and step-level reward estimation with consistency geometric constraints.
- Score: 64.67498968405327
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.
Related papers
- REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories [19.741468026765062]
We introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for embodied spatial reasoning.<n> REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints.<n>Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans.
arXiv Detail & Related papers (2025-11-30T05:20:22Z) - SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition [19.526371771173064]
spatial cognition is fundamental to real-world multimodal intelligence, allowing models to interact with the physical environment.<n>Existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric.<n>We propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels.
arXiv Detail & Related papers (2025-11-26T15:04:18Z) - Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.15756345836901]
We provide a comprehensive review of multimodal spatial reasoning tasks with large models.<n>We review advances in embodied AI, including vision-language navigation and action models.<n>We consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors.
arXiv Detail & Related papers (2025-10-29T17:55:43Z) - 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis [54.24689751375923]
This work introduces a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs.<n>Through experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition.<n>These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities.
arXiv Detail & Related papers (2025-08-27T17:22:34Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes [84.1059652774853]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks.<n>Recent studies have exposed critical limitations in their spatial reasoning capabilities.<n>This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z) - EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z) - SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning [42.487500113839666]
We propose a novel approach to bolster the spatial reasoning capabilities of Vision-Language Models (VLMs)<n>Our approach comprises two stages: spatial coordinate bi-directional alignment, and chain-of-thought spatial grounding.<n>We evaluate our method on challenging navigation and manipulation tasks, both in simulation and real-world settings.
arXiv Detail & Related papers (2025-01-17T09:46:27Z) - Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.