Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views
- URL: http://arxiv.org/abs/2511.07813v1
- Date: Wed, 12 Nov 2025 01:20:57 GMT
- Title: Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views
- Authors: Haida Feng, Hao Wei, Zewen Xu, Haolin Wang, Chade Li, Yihong Wu,
- Abstract summary: We propose Sparse3DPR, a training-free framework for open-ended scene understanding.<n>We introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors.<n>We show that Sparse3DPR achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench.
- Score: 7.846553013153199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.
Related papers
- SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding [44.82926606018167]
3D Visual Grounding aims to localize target objects within a 3D scene based on natural language queries.<n>In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework.<n>Experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods.
arXiv Detail & Related papers (2025-06-27T05:34:57Z) - SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting [104.83629308412958]
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics.<n>We propose the first large-scale benchmark that systematically assesses three groups of methods directly in 3D space.<n>Results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation.
arXiv Detail & Related papers (2025-06-10T11:52:45Z) - OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting [52.40697058096931]
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction.<n>We introduce an approach for open-vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D.<n>We show results on LERF-mask and LERF-OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2025-06-09T12:37:15Z) - Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs [16.153129392697885]
We introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives.<n>The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities.<n>Our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30times$ faster.
arXiv Detail & Related papers (2025-04-17T17:56:07Z) - ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning [68.4209681278336]
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions.<n>Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals.<n>We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
arXiv Detail & Related papers (2025-03-30T03:40:35Z) - Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying [25.32838673665989]
Open-vocabulary 3D scene understanding is crucial for robotics applications, such as natural language-driven manipulation.<n>Existing methods for querying 3D Gaussian Splatting often struggle with inconsistent 2D mask supervision.<n>We present a novel point-level querying framework that performs tracking on segmentation masks to establish a semantically consistent ground-truth.
arXiv Detail & Related papers (2025-03-27T17:59:05Z) - Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning [28.80962812015936]
Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution.<n>We propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms.<n>We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.
arXiv Detail & Related papers (2025-03-06T18:17:09Z) - Language-to-Space Programming for Training-Free 3D Visual Grounding [38.39850802321939]
Language-to-Space Programming (LaSP) is a training-free method for 3D visual grounding.<n>LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods.
arXiv Detail & Related papers (2025-02-03T14:32:36Z) - SLGaussian: Fast Language Gaussian Splatting in Sparse Views [15.0280871846496]
We propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints.<n>SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions.
arXiv Detail & Related papers (2024-12-11T12:18:30Z) - SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images [125.66499135980344]
We propose SparseGrasp, a novel open-vocabulary robotic grasping system.<n>SparseGrasp operates efficiently with sparse-view RGB images and handles scene updates fastly.<n>We show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability.
arXiv Detail & Related papers (2024-12-03T03:56:01Z) - GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane [53.388937705785025]
3D open-vocabulary scene understanding is crucial for advancing augmented reality and robotic applications.
We introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS)
Our method treats the feature selection process as a hyperplane division within the feature space, retaining only features that are highly relevant to the query.
arXiv Detail & Related papers (2024-05-27T18:57:18Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.