SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion
- URL: http://arxiv.org/abs/2511.17308v1
- Date: Fri, 21 Nov 2025 15:24:33 GMT
- Title: SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion
- Authors: Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, Weida Wang,
- Abstract summary: Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks.<n>Most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space.<n>We propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding.
- Score: 23.86761713752287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.
Related papers
- Thinking with Geometry: Active Geometry Integration for Spatial Reasoning [68.59084007360615]
We propose GeoThinker, a framework that shifts paradigm passive fusion to active perception.<n>Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands.<n>Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.
arXiv Detail & Related papers (2026-02-05T18:59:32Z) - Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions [18.455501447828343]
Spatial Intelligence (SI) has predominantly relied on Vision-Language Models (VLMs)<n>We introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input.<n>We find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential.
arXiv Detail & Related papers (2026-01-07T05:13:52Z) - S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance [20.55536735670125]
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions.<n>Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG.<n>We propose S$2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning.
arXiv Detail & Related papers (2025-12-01T03:08:34Z) - SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards [37.39035418889281]
We introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning.<n>The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards.
arXiv Detail & Related papers (2025-11-10T18:52:47Z) - Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [75.45940282834327]
We introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs.<n>We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs.<n>Our approach employs a two-stage fine-tuning strategy, resulting in significant improvements across multiple tasks.
arXiv Detail & Related papers (2025-11-03T14:27:00Z) - Spatial Preference Rewarding for MLLMs Spatial Understanding [92.25703021388142]
Multimodal large language models (MLLMs) have demonstrated promising spatial understanding capabilities.<n>Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities.<n>We propose a Spatial Preference Rewarding(SPR) approach that enhances MLLMs' spatial capabilities.
arXiv Detail & Related papers (2025-10-16T07:16:18Z) - See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model [33.18304419115947]
We introduce SEE&TREK, the first training-free prompting framework to enhance the spatial understanding of Multimodal Large Language Models (MLLM) under vision-only constraints.<n>We focus on increasing visual diversity and motion reconstruction.<n>Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLMS.
arXiv Detail & Related papers (2025-09-19T15:30:26Z) - Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture [16.15618237704827]
We present a systematic analysis of spatial understanding from both data and architectural perspectives.<n>From the data perspective, the performance of spatial understanding converges quickly as the training data increases.<n>From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model.
arXiv Detail & Related papers (2025-09-02T14:22:43Z) - FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations [78.65988445433844]
FloorplanQA is a diagnostic benchmark for evaluating spatial reasoning in large-language models.<n>The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces.
arXiv Detail & Related papers (2025-07-10T11:16:48Z) - Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence [13.168559963356952]
We present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations.<n>Our key insight is to unleash the strong structure prior to the feed-forward visual geometry foundation model.<n>A connector then integrates both features into unified visual tokens for enhanced spatial understanding.
arXiv Detail & Related papers (2025-05-29T17:59:04Z) - SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.
During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.
Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z) - Evaluating Spatial Understanding of Large Language Models [26.436450329727645]
Large language models show remarkable capabilities across a variety of tasks.
Recent studies suggest that LLM representations implicitly capture aspects of the underlying grounded concepts.
We design natural-language navigation tasks and evaluate the ability of LLMs to represent and reason about spatial structures.
arXiv Detail & Related papers (2023-10-23T03:44:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.