Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting
- URL: http://arxiv.org/abs/2509.05515v1
- Date: Fri, 05 Sep 2025 21:56:11 GMT
- Title: Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting
- Authors: Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, Stefano Gasperini,
- Abstract summary: We introduce Visibility-Aware Language Aggregation (VALA), a method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians.<n>Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner.
- Score: 42.85503386524195
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works.
Related papers
- FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views [52.02871618456553]
FLEG is a feed-forward network that reconstructs language-embedded 3D Gaussians from any views.<n>We propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images.
arXiv Detail & Related papers (2025-12-19T13:04:13Z) - C3G: Learning Compact 3D Representations with 2K Gaussians [55.04010158339562]
Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding.<n>We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations.
arXiv Detail & Related papers (2025-12-03T17:59:05Z) - OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion [89.98812408058336]
We introduce textbfOpenInsGaussian, an textbfOpen-vocabulary textbfInstance textbfGaussian segmentation framework with Context-aware Cross-view Fusion.<n>OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin.
arXiv Detail & Related papers (2025-10-21T03:24:12Z) - GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting [74.56128224977279]
We present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS)<n>GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning.<n>It supports seamless 2D and 3D open-vocabulary queries and reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning.
arXiv Detail & Related papers (2025-08-19T21:26:49Z) - GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond [56.677984098204696]
multimodal language models are driving the development of 3D Vision-Language Models (VLMs)<n>We propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations.<n>We present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images.
arXiv Detail & Related papers (2025-07-01T15:52:59Z) - Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding [15.86865606131156]
We introduce MVOV3D, a novel approach aimed at unleashing the potential of 2D multi-view fusion for open-vocabulary 3D scene understanding.<n>Specifically, MVOV3D improves multi-view 2D features by leveraging precise region-level image features and text features encoded by CLIP encoders.<n>Our method achieves a new record with 14.7% mIoU on ScanNet200 and 16.2% mIoU on Matterport160 for challenge open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2025-06-28T08:40:42Z) - CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting [18.581169318975046]
3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, but cross-view granularity inconsistency is a problem.<n>We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS.<n>CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet.
arXiv Detail & Related papers (2025-04-16T09:20:03Z) - SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians [77.77265204740037]
3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering.<n>We introduce SuperGSeg, a novel approach that fosters cohesive, context-aware scene representation.<n>SuperGSeg outperforms prior works on both open-vocabulary object localization and semantic segmentation tasks.
arXiv Detail & Related papers (2024-12-13T16:01:19Z) - Occam's LGS: An Efficient Approach for Language Gaussian Splatting [57.00354758206751]
We show that the complicated pipelines for language 3D Gaussian Splatting are simply unnecessary.<n>We apply Occam's razor to the task at hand, leading to a highly efficient weighted multi-view feature aggregation technique.
arXiv Detail & Related papers (2024-12-02T18:50:37Z) - Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding [2.517953665531978]
We introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks.
Our representation achieves the best visual quality and language querying accuracy across current language-embedded representations.
arXiv Detail & Related papers (2023-11-30T11:50:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.