Related papers: SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

URL: http://arxiv.org/abs/2512.03284v1
Date: Tue, 02 Dec 2025 22:49:01 GMT
Title: SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding
Authors: Hongpei Zheng, Shijie Li, Yanran Li, Hujun Yin,
Abstract summary: We introduce H$2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset for house-scale scene understanding.<n>We also propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes.
Score: 13.974575930417709
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.

Related papers

Sparse Multiview Open-Vocabulary 3D Detection [27.57172918603858]
3D object detection has traditionally been solved by training to detect a fixed set of categories.<n>In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting.<n>Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion.
arXiv Detail & Related papers (2025-09-19T12:22:24Z)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [54.04601077224252]
Embodied scene understanding requires not only comprehending visual-spatial information but also determining where to explore next in the 3D physical world.<n>underlinetextbf3D vision-language learning enables embodied agents to effectively explore and understand their environment.<n>model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images.
arXiv Detail & Related papers (2025-07-05T14:15:52Z)
SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding [44.82926606018167]
3D Visual Grounding aims to localize target objects within a 3D scene based on natural language queries.<n>In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework.<n>Experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods.
arXiv Detail & Related papers (2025-06-27T05:34:57Z)
SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting [104.83629308412958]
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics.<n>We propose the first large-scale benchmark that systematically assesses three groups of methods directly in 3D space.<n>Results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation.
arXiv Detail & Related papers (2025-06-10T11:52:45Z)
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes [30.015378490907988]
Anywhere3D-Bench is a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs.<n>We assess a range of state-of-the-art 3D visual grounding methods alongside large language models.
arXiv Detail & Related papers (2025-06-05T11:28:02Z)
H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision [41.529084775662355]
We present a novel 3D occupancy prediction approach, H3O, which features highly efficient architecture designs that incur a significantly lower computational cost as compared to the current state-of-the-art methods.<n>In particular, we integrate multi-camera depth estimation, semantic segmentation, and surface normal estimation via differentiable volume rendering, supervised by corresponding 2D labels.
arXiv Detail & Related papers (2025-03-06T03:27:14Z)
SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation [50.420711084672966]
We present SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction.<n> Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories.
arXiv Detail & Related papers (2025-01-28T03:41:24Z)
Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z)
HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem. Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians. Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z)
SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving. We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.