Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling
- URL: http://arxiv.org/abs/2512.01821v2
- Date: Mon, 08 Dec 2025 17:14:06 GMT
- Title: Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling
- Authors: Meng Cao, Haokun Lin, Haoyuan Li, Haoran Tang, Rongtao Xu, Dong An, Xue Liu, Ian Reid, Xiaodan Liang,
- Abstract summary: This paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like imagination.<n>We show that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks.
- Score: 68.14113731953971
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.
Related papers
- Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding [22.218083641125137]
Existing 3D scene understanding methods mostly distill language features from 2D foundation models into 3D feature fields.<n>We propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry.<n>Our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.
arXiv Detail & Related papers (2026-02-17T17:10:13Z) - OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding [53.33067495235966]
OnlineSI is a framework that can improve its spatial understanding of its surroundings given a video stream.<n>Our core idea is to maintain a finite spatial memory to retain past observations.<n>We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene.
arXiv Detail & Related papers (2026-01-23T08:17:57Z) - G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning [36.62798449863548]
Vision-Language Models (VLMs) still lack robustness in spatial intelligence.<n>We present G$2$VLM, a vision-language model that bridges two fundamental aspects of spatial intelligence.
arXiv Detail & Related papers (2025-11-26T18:59:39Z) - Imagine in Space: Exploring the Frontier of Spatial Intelligence and Reasoning Efficiency in Vision Language Models [23.12717700882611]
spatial reasoning is a fundamental component of human cognition.<n>Current large language models (LLMs) and vision language models (VLMs) have demonstrated remarkable reasoning capabilities across logical inference, problem solving, and decision making.<n>We hypothesize that imagination, the internal simulation of spatial states, is the dominant reasoning mechanism within a spatial world model.
arXiv Detail & Related papers (2025-11-16T03:09:55Z) - UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding [65.60549881706959]
We introduce UniUGG, the first unified understanding and generation framework for 3D modalities.<n>Our framework employs an LLM to comprehend and decode sentences and 3D representations.<n>We propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations.
arXiv Detail & Related papers (2025-08-16T07:27:31Z) - Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [11.222744122842023]
We introduce a plug-and-play module that implicitly incorporates 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z) - Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence [13.168559963356952]
We present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations.<n>Our key insight is to unleash the strong structure prior to the feed-forward visual geometry foundation model.<n>A connector then integrates both features into unified visual tokens for enhanced spatial understanding.
arXiv Detail & Related papers (2025-05-29T17:59:04Z) - MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams [65.02628814094639]
Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.<n>Current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether Multimodal Large Language Models genuinely understand mathematical diagrams beyond superficial pattern recognition.<n>We introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs.<n>We construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text annotated with geometric primitives and precise spatial relationships.
arXiv Detail & Related papers (2025-03-26T17:30:41Z) - Re-Thinking Inverse Graphics With Large Language Models [51.333105116400205]
Inverse graphics -- inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics.
We propose the Inverse-Graphics Large Language Model (IG-LLM), an inversegraphics framework centered around an LLM.
We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training.
arXiv Detail & Related papers (2024-04-23T16:59:02Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.