SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding
- URL: http://arxiv.org/abs/2512.03284v1
- Date: Tue, 02 Dec 2025 22:49:01 GMT
- Title: SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding
- Authors: Hongpei Zheng, Shijie Li, Yanran Li, Hujun Yin,
- Abstract summary: We introduce H$2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset for house-scale scene understanding.<n>We also propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes.
- Score: 13.974575930417709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.
Related papers
- Sparse Multiview Open-Vocabulary 3D Detection [27.57172918603858]
3D object detection has traditionally been solved by training to detect a fixed set of categories.<n>In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting.<n>Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion.
arXiv Detail & Related papers (2025-09-19T12:22:24Z) - Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z) - SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding [44.82926606018167]
3D Visual Grounding aims to localize target objects within a 3D scene based on natural language queries.<n>In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework.<n>Experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods.
arXiv Detail & Related papers (2025-06-27T05:34:57Z) - SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting [104.83629308412958]
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics.<n>We propose the first large-scale benchmark that systematically assesses three groups of methods directly in 3D space.<n>Results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation.
arXiv Detail & Related papers (2025-06-10T11:52:45Z) - From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes [30.015378490907988]
Anywhere3D-Bench is a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs.<n>We assess a range of state-of-the-art 3D visual grounding methods alongside large language models.
arXiv Detail & Related papers (2025-06-05T11:28:02Z) - H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision [41.529084775662355]
We present a novel 3D occupancy prediction approach, H3O, which features highly efficient architecture designs that incur a significantly lower computational cost as compared to the current state-of-the-art methods.<n>In particular, we integrate multi-camera depth estimation, semantic segmentation, and surface normal estimation via differentiable volume rendering, supervised by corresponding 2D labels.
arXiv Detail & Related papers (2025-03-06T03:27:14Z) - SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation [50.420711084672966]
We present SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction.<n> Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories.
arXiv Detail & Related papers (2025-01-28T03:41:24Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem.
Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians.
Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.