Related papers: Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

URL: http://arxiv.org/abs/2511.11239v2
Date: Tue, 18 Nov 2025 15:36:54 GMT
Title: Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression
Authors: Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, Ping Jian,
Abstract summary: Existing Vision Language Models (VLMs) struggle to comprehend real-world 3D spatial intelligence.<n> GEODE augments main VLM with two specialized, plug-and-play modules.<n>The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher.
Score: 12.590536117486257
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.

Related papers

Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning [43.746951848993035]
spatial intelligence can emerge from 2D vision alone, rather than being imposed via explicit spatial instruction tuning.<n>We introduce Spa3R, a self-supervised framework that learns a unified view-in spatial representation directly from unposed multi-view images.<n>Experiments demonstrate Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods.
arXiv Detail & Related papers (2026-02-24T18:37:34Z)
Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z)
D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation [66.7166217399105]
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning.<n>Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data.
arXiv Detail & Related papers (2025-12-14T09:53:15Z)
Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation [34.44214123004662]
We propose VLM3D, a framework for differentiable semantic and spatial critics.<n>Our core contribution is a dual-language critic signal derived from the VLM's Yes or No log-odds.<n>VLM3D establishes a principled and general path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
arXiv Detail & Related papers (2025-11-18T09:05:26Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion [89.98812408058336]
We introduce textbfOpenInsGaussian, an textbfOpen-vocabulary textbfInstance textbfGaussian segmentation framework with Context-aware Cross-view Fusion.<n>OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin.
arXiv Detail & Related papers (2025-10-21T03:24:12Z)
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors [54.84863164684646]
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders.<n>In this work, we introduce FALCON, a novel paradigm that injects rich 3D spatial tokens into the action head.
arXiv Detail & Related papers (2025-10-20T11:26:45Z)
Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding [0.8883733362171032]
We propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue.<n>Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for "What" identification and 3D features as spatial anchors for "Where" localization.<n>Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness.
arXiv Detail & Related papers (2025-10-19T22:40:18Z)
Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge [45.19482892758984]
Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR.<n>We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimize reconstruction, affinity, and diversity to yield semantically organized representations.<n>We further design the Cross-modal Affordance Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps.
arXiv Detail & Related papers (2025-10-09T15:01:26Z)
Semantic Causality-Aware Vision-Based 3D Occupancy Prediction [63.752869043357585]
Vision-based 3D semantic occupancy prediction is a critical task in 3D vision.<n>Existing methods, however, often rely on modular pipelines.<n>We propose a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline.
arXiv Detail & Related papers (2025-09-10T08:29:22Z)
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding [58.38294408121273]
We propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D.<n>Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties.
arXiv Detail & Related papers (2025-03-20T20:58:48Z)
CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation [7.246959698735599]
CleverDistiller is a self-supervised, cross-modal 2D-to-3D KD framework.<n>It achieves state-of-the-art performance in both semantic segmentation and 3D object detection by up to 10% mIoU.
arXiv Detail & Related papers (2025-03-12T22:18:29Z)
Global-Aware Monocular Semantic Scene Completion with State Space Models [25.621011183332094]
Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image.<n>Existing methods are often constrained by the local receptive field of Convolutional Networks (CNNs)<n>We introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space.
arXiv Detail & Related papers (2025-03-09T11:55:40Z)
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z)
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs. We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels. We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.