Related papers: GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

URL: http://arxiv.org/abs/2508.14278v2
Date: Thu, 21 Aug 2025 09:47:52 GMT
Title: GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting
Authors: Elena Alegret, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari,
Abstract summary: We present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS)<n>GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning.<n>It supports seamless 2D and 3D open-vocabulary queries and reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning.
Score: 74.56128224977279
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.

Related papers

FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views [52.02871618456553]
FLEG is a feed-forward network that reconstructs language-embedded 3D Gaussians from any views.<n>We propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images.
arXiv Detail & Related papers (2025-12-19T13:04:13Z)
Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting [42.85503386524195]
We introduce Visibility-Aware Language Aggregation (VALA), a method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians.<n>Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner.
arXiv Detail & Related papers (2025-09-05T21:56:11Z)
Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop [0.0]
2D vision-language models trained on web-scale image-text pairs exhibit rich semantic understanding and support open-vocabulary detection.<n>We leverage the maturity and category diversity of 2D foundation models to perform 3D object detection without any human-annotated 3D labels.<n>Our results highlight the untapped potential of 2D foundation models for scalable 3D perception.
arXiv Detail & Related papers (2025-07-06T15:00:13Z)
Tackling View-Dependent Semantics in 3D Language Gaussian Splatting [80.88015191411714]
LaGa establishes cross-view semantic connections by decomposing the 3D scene into objects.<n>It constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics.<n>Under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset.
arXiv Detail & Related papers (2025-05-30T16:06:32Z)
PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding [8.72555461868951]
3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks.<n>Previous methods cannot distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query.<n>We propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach.
arXiv Detail & Related papers (2025-03-23T15:27:29Z)
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting [68.37013525040891]
We propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation.<n>We demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation.
arXiv Detail & Related papers (2025-02-25T05:10:22Z)
OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies [112.80292725951921]
textbfOVGaussian is a generalizable textbfOpen-textbfVocabulary 3D semantic segmentation framework based on the 3D textbfGaussian representation.<n>We first construct a large-scale 3D scene dataset based on 3DGS, dubbed textbfSegGaussian, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images.<n>To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a
arXiv Detail & Related papers (2024-12-31T07:55:35Z)
OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding [54.981605111365056]
This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding.<n>Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing.
arXiv Detail & Related papers (2024-06-04T07:42:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.