Related papers: Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views

URL: http://arxiv.org/abs/2511.07813v1
Date: Wed, 12 Nov 2025 01:20:57 GMT
Title: Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views
Authors: Haida Feng, Hao Wei, Zewen Xu, Haolin Wang, Chade Li, Yihong Wu,
Abstract summary: We propose Sparse3DPR, a training-free framework for open-ended scene understanding.<n>We introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors.<n>We show that Sparse3DPR achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench.
Score: 7.846553013153199
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.

Related papers

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding [44.82926606018167]
3D Visual Grounding aims to localize target objects within a 3D scene based on natural language queries.<n>In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework.<n>Experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods.
arXiv Detail & Related papers (2025-06-27T05:34:57Z)
SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting [104.83629308412958]
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics.<n>We propose the first large-scale benchmark that systematically assesses three groups of methods directly in 3D space.<n>Results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation.
arXiv Detail & Related papers (2025-06-10T11:52:45Z)
OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting [52.40697058096931]
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction.<n>We introduce an approach for open-vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D.<n>We show results on LERF-mask and LERF-OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2025-06-09T12:37:15Z)
Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs [16.153129392697885]
We introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives.<n>The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities.<n>Our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30times$ faster.
arXiv Detail & Related papers (2025-04-17T17:56:07Z)
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning [68.4209681278336]
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions.<n>Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals.<n>We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
arXiv Detail & Related papers (2025-03-30T03:40:35Z)
Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying [25.32838673665989]
Open-vocabulary 3D scene understanding is crucial for robotics applications, such as natural language-driven manipulation.<n>Existing methods for querying 3D Gaussian Splatting often struggle with inconsistent 2D mask supervision.<n>We present a novel point-level querying framework that performs tracking on segmentation masks to establish a semantically consistent ground-truth.
arXiv Detail & Related papers (2025-03-27T17:59:05Z)
Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning [28.80962812015936]
Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution.<n>We propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms.<n>We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.
arXiv Detail & Related papers (2025-03-06T18:17:09Z)
Language-to-Space Programming for Training-Free 3D Visual Grounding [38.39850802321939]
Language-to-Space Programming (LaSP) is a training-free method for 3D visual grounding.<n>LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods.
arXiv Detail & Related papers (2025-02-03T14:32:36Z)
SLGaussian: Fast Language Gaussian Splatting in Sparse Views [15.0280871846496]
We propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints.<n>SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions.
arXiv Detail & Related papers (2024-12-11T12:18:30Z)
SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images [125.66499135980344]
We propose SparseGrasp, a novel open-vocabulary robotic grasping system.<n>SparseGrasp operates efficiently with sparse-view RGB images and handles scene updates fastly.<n>We show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability.
arXiv Detail & Related papers (2024-12-03T03:56:01Z)
GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane [53.388937705785025]
3D open-vocabulary scene understanding is crucial for advancing augmented reality and robotic applications. We introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) Our method treats the feature selection process as a hyperplane division within the feature space, retaining only features that are highly relevant to the query.
arXiv Detail & Related papers (2024-05-27T18:57:18Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.