GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting
- URL: http://arxiv.org/abs/2508.14278v2
- Date: Thu, 21 Aug 2025 09:47:52 GMT
- Title: GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting
- Authors: Elena Alegret, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari,
- Abstract summary: We present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS)<n>GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning.<n>It supports seamless 2D and 3D open-vocabulary queries and reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning.
- Score: 74.56128224977279
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.
Related papers
- FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views [52.02871618456553]
FLEG is a feed-forward network that reconstructs language-embedded 3D Gaussians from any views.<n>We propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images.
arXiv Detail & Related papers (2025-12-19T13:04:13Z) - Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting [42.85503386524195]
We introduce Visibility-Aware Language Aggregation (VALA), a method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians.<n>Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner.
arXiv Detail & Related papers (2025-09-05T21:56:11Z) - Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop [0.0]
2D vision-language models trained on web-scale image-text pairs exhibit rich semantic understanding and support open-vocabulary detection.<n>We leverage the maturity and category diversity of 2D foundation models to perform 3D object detection without any human-annotated 3D labels.<n>Our results highlight the untapped potential of 2D foundation models for scalable 3D perception.
arXiv Detail & Related papers (2025-07-06T15:00:13Z) - Tackling View-Dependent Semantics in 3D Language Gaussian Splatting [80.88015191411714]
LaGa establishes cross-view semantic connections by decomposing the 3D scene into objects.<n>It constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics.<n>Under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset.
arXiv Detail & Related papers (2025-05-30T16:06:32Z) - PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding [8.72555461868951]
3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks.<n>Previous methods cannot distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query.<n>We propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach.
arXiv Detail & Related papers (2025-03-23T15:27:29Z) - UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting [68.37013525040891]
We propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation.<n>We demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation.
arXiv Detail & Related papers (2025-02-25T05:10:22Z) - OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies [112.80292725951921]
textbfOVGaussian is a generalizable textbfOpen-textbfVocabulary 3D semantic segmentation framework based on the 3D textbfGaussian representation.<n>We first construct a large-scale 3D scene dataset based on 3DGS, dubbed textbfSegGaussian, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images.<n>To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a
arXiv Detail & Related papers (2024-12-31T07:55:35Z) - OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding [54.981605111365056]
This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding.<n>Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing.
arXiv Detail & Related papers (2024-06-04T07:42:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.