Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings
- URL: http://arxiv.org/abs/2509.12938v1
- Date: Tue, 16 Sep 2025 10:39:37 GMT
- Title: Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings
- Authors: Abdalla Arafa, Didier Stricker,
- Abstract summary: We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely.<n>Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation.<n>This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction.
- Score: 17.855913571198013
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Novel view synthesis has seen significant advancements with 3D Gaussian Splatting (3DGS), enabling real-time photorealistic rendering. However, the inherent fuzziness of Gaussian Splatting presents challenges for 3D scene understanding, restricting its broader applications in AR/VR and robotics. While recent works attempt to learn semantics via 2D foundation model distillation, they inherit fundamental limitations: alpha blending averages semantics across objects, making 3D-level understanding impossible. We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely. Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation, creating comprehensive "bags of embeddings" that holistically describe objects. This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction. Experiments demonstrate that our method effectively overcomes the challenges of 3D open-vocabulary object extraction while remaining comparable to state-of-the-art performance in 2D open-vocabulary segmentation, ensuring minimal compromise.
Related papers
- FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views [52.02871618456553]
FLEG is a feed-forward network that reconstructs language-embedded 3D Gaussians from any views.<n>We propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images.
arXiv Detail & Related papers (2025-12-19T13:04:13Z) - OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion [89.98812408058336]
We introduce textbfOpenInsGaussian, an textbfOpen-vocabulary textbfInstance textbfGaussian segmentation framework with Context-aware Cross-view Fusion.<n>OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin.
arXiv Detail & Related papers (2025-10-21T03:24:12Z) - GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting [74.56128224977279]
We present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS)<n>GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning.<n>It supports seamless 2D and 3D open-vocabulary queries and reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning.
arXiv Detail & Related papers (2025-08-19T21:26:49Z) - Tackling View-Dependent Semantics in 3D Language Gaussian Splatting [80.88015191411714]
LaGa establishes cross-view semantic connections by decomposing the 3D scene into objects.<n>It constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics.<n>Under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset.
arXiv Detail & Related papers (2025-05-30T16:06:32Z) - Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs [16.153129392697885]
We introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives.<n>The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities.<n>Our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30times$ faster.
arXiv Detail & Related papers (2025-04-17T17:56:07Z) - OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding [20.578106363482018]
OpenGS-SLAM is an innovative framework that utilizes 3D Gaussian representation to perform dense semantic SLAM in open-set environments.<n>Our system integrates explicit semantic labels derived from 2D models into the 3D Gaussian framework, facilitating robust 3D object-level understanding.<n>Our method achieves 10 times faster semantic rendering and 2 times lower storage costs compared to existing methods.
arXiv Detail & Related papers (2025-03-03T15:23:21Z) - Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.<n>FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z) - OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding [54.981605111365056]
This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding.<n>Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing.
arXiv Detail & Related papers (2024-06-04T07:42:33Z) - Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting [27.974762304763694]
We introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting.
Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features into a novel semantic component of 3D Gaussians.
We build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference.
arXiv Detail & Related papers (2024-03-22T21:28:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.