SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning
- URL: http://arxiv.org/abs/2603.00409v1
- Date: Sat, 28 Feb 2026 02:05:35 GMT
- Title: SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning
- Authors: Yi Zhang, Youya Xia, Yong Wang, Meng Song, Xin Wu, Wenjun Wan, Bingbing Liu, AiXue Ye, Hongbo Zhang, Feng Wen,
- Abstract summary: SSR is a framework designed for Structured Scene Reasoning.<n>It seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism.<n>It achieves state-of-the-art performance on multiple spatial intelligence benchmarks.
- Score: 30.87517633729756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.
Related papers
- Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z) - Enhancing Floor Plan Recognition: A Hybrid Mix-Transformer and U-Net Approach for Precise Wall Segmentation [0.0]
This study introduces MitUNet, a hybrid neural network combining a Mix-Transformer encoder and a U-Net decoder.<n>Our approach achieves a balance between precision and recall, ensuring accurate boundary recovery.<n> Experiments on the CubiCasa5k dataset and a proprietary regional dataset demonstrate MitUNet's superiority in generating structurally correct masks.
arXiv Detail & Related papers (2025-12-02T04:47:53Z) - LATTICE: Democratize High-Fidelity 3D Generation at Scale [27.310104395842075]
LATTICE is a new framework for high-fidelity 3D asset generation.<n> VoxSet is a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid.<n>Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes.
arXiv Detail & Related papers (2025-11-24T03:22:19Z) - SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation [114.57192386025373]
SegSplat is a novel framework designed to bridge the gap between rapid, feed-forward 3D reconstruction and rich, open-vocabulary semantic understanding.<n>This work represents a significant step towards practical, on-the-fly generation of semantically aware 3D environments.
arXiv Detail & Related papers (2025-11-23T10:26:38Z) - CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation [16.85102888388904]
CUS-GS is a compact unified structured Gaussian Splatting representation.<n>We propose a feature-aware significance evaluation strategy to guide anchor growing and pruning.<n>CUS-GS achieves competitive performance compared to state-of-the-art methods using as few as 6M parameters.
arXiv Detail & Related papers (2025-11-22T03:42:49Z) - Video Spatial Reasoning with Object-Centric 3D Rollout [58.12446467377404]
We propose Object-Centric 3D Rollout (OCR) to enable robust video spatial reasoning.<n>OCR introduces structured perturbations to the 3D geometry of selected objects during training.<n>OCR compels the model to reason holistically across the entire scene.
arXiv Detail & Related papers (2025-11-17T09:53:41Z) - IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction [82.53307702809606]
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions.<n>We propose InstanceGrounded Geometry Transformer (IGGT) to unify the knowledge for both spatial reconstruction and instance-level contextual understanding.
arXiv Detail & Related papers (2025-10-26T14:57:44Z) - SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z) - Spatial Reasoner: A 3D Inference Pipeline for XR Applications [0.0]
We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks.<n>Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates.<n>The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation.
arXiv Detail & Related papers (2025-04-25T14:27:27Z) - Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.