Related papers: Polysemous Language Gaussian Splatting via Matching-based Mask Lifting

Polysemous Language Gaussian Splatting via Matching-based Mask Lifting

URL: http://arxiv.org/abs/2509.22225v1
Date: Fri, 26 Sep 2025 11:38:05 GMT
Title: Polysemous Language Gaussian Splatting via Matching-based Mask Lifting
Authors: Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, Ge Li,
Abstract summary: MUSplat is a training-free framework that abandons feature optimization entirely.<n>Our pipeline generates and lifts multi-granularity 2D masks into 3D, where we estimate a foreground probability for each Gaussian point to form initial object groups.<n>We then optimize the ambiguous boundaries of these initial groups using semantic entropy and geometric opacity.
Score: 16.769952481766445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. However, mainstream methods suffer from three key flaws: (i) their reliance on costly per-scene retraining prevents plug-and-play application; (ii) their restrictive monosemous design fails to represent complex, multi-concept semantics; and (iii) their vulnerability to cross-view semantic inconsistencies corrupts the final semantic representation. To overcome these limitations, we introduce MUSplat, a training-free framework that abandons feature optimization entirely. Leveraging a pre-trained 2D segmentation model, our pipeline generates and lifts multi-granularity 2D masks into 3D, where we estimate a foreground probability for each Gaussian point to form initial object groups. We then optimize the ambiguous boundaries of these initial groups using semantic entropy and geometric opacity. Subsequently, by interpreting the object's appearance across its most representative viewpoints, a Vision-Language Model (VLM) distills robust textual features that reconciles visual inconsistencies, enabling open-vocabulary querying via semantic matching. By eliminating the costly per-scene training process, MUSplat reduces scene adaptation time from hours to mere minutes. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, MUSplat outperforms established training-based frameworks while simultaneously addressing their monosemous limitations.

Related papers

SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction [5.730573889498275]
We propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework.<n>We first utilize semantic and uncertainty priors to suppress projections from free space during view transformation.<n>We then employ an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation.
arXiv Detail & Related papers (2026-01-16T16:07:38Z)
OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion [89.98812408058336]
We introduce textbfOpenInsGaussian, an textbfOpen-vocabulary textbfInstance textbfGaussian segmentation framework with Context-aware Cross-view Fusion.<n>OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin.
arXiv Detail & Related papers (2025-10-21T03:24:12Z)
BEEP3D: Box-Supervised End-to-End Pseudo-Mask Generation for 3D Instance Segmentation [28.97274092946373]
3D instance segmentation is crucial for understanding complex 3D environments, yet fully supervised methods require dense point-level annotations.<n>Box-level annotations inherently introduce ambiguity in overlapping regions, making accurate point-to-instance assignment challenging.<n>Recent methods address this ambiguity by generating pseudo-masks through training a dedicated pseudo-labeler in an additional training stage.<n>We propose BEEP3D-Box-supervised End-to-End Pseudo-mask generation for 3D instance segmentation.
arXiv Detail & Related papers (2025-10-14T06:23:18Z)
Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings [17.855913571198013]
We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely.<n>Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation.<n>This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction.
arXiv Detail & Related papers (2025-09-16T10:39:37Z)
Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model [51.02616473941499]
3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization.<n>However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures.<n>We present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds.
arXiv Detail & Related papers (2025-09-09T15:01:28Z)
UniForward: Unified 3D Scene and Semantic Field Reconstruction via Feed-Forward Gaussian Splatting from Only Sparse-View Images [43.40816438003861]
We propose a feed-forward model that unifies 3D scene and semantic field reconstruction.<n>Our UniForward can reconstruct 3D scenes and the corresponding semantic fields in real time from only sparse-view images.<n> Experiments on novel view synthesis and novel view segmentation demonstrate that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2025-06-11T04:01:21Z)
CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting [53.15827818829865]
Methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies.<n>We propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues.<n>Our framework explicitly resolves semantic conflicts while preserving category discriminability.
arXiv Detail & Related papers (2025-05-26T19:09:33Z)
CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting [18.581169318975046]
3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, but cross-view granularity inconsistency is a problem.<n>We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS.<n>CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet.
arXiv Detail & Related papers (2025-04-16T09:20:03Z)
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding [58.38294408121273]
We propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D.<n>Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties.
arXiv Detail & Related papers (2025-03-20T20:58:48Z)
Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.<n>FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z)
Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining [41.145598142457686]
LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications. We propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames. Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets.
arXiv Detail & Related papers (2024-07-10T08:46:29Z)
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. We propose to leverage weakly supervised annotations to learn the 3D visual grounding model. We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.