Related papers: Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes

Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes

URL: http://arxiv.org/abs/2512.07807v1
Date: Mon, 08 Dec 2025 18:39:58 GMT
Title: Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes
Authors: Shai Krakovsky, Gal Fiebelman, Sagie Benaim, Hadar Averbuch-Elor,
Abstract summary: Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments.<n>We propose a novel approach to address challenges in semantic feature misalignment and inefficiency in memory and runtime.<n>We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
Score: 23.445409551683213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.

Related papers

FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views [52.02871618456553]
FLEG is a feed-forward network that reconstructs language-embedded 3D Gaussians from any views.<n>We propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images.
arXiv Detail & Related papers (2025-12-19T13:04:13Z)
SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields [33.113865514268085]
Holistic 3D scene understanding is crucial for applications like augmented reality and robotic interaction.<n>Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes.<n>We propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method.
arXiv Detail & Related papers (2025-06-11T09:56:39Z)
Tackling View-Dependent Semantics in 3D Language Gaussian Splatting [80.88015191411714]
LaGa establishes cross-view semantic connections by decomposing the 3D scene into objects.<n>It constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics.<n>Under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset.
arXiv Detail & Related papers (2025-05-30T16:06:32Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.<n>Existing approaches commonly encounter a shortage of text3D pairs available for training.<n>We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding [42.750252190275546]
LangSurf is a language-Embedded Surface Field that aligns 3D language fields with the surface of objects.<n>Our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing.
arXiv Detail & Related papers (2024-12-23T15:12:20Z)
Occam's LGS: An Efficient Approach for Language Gaussian Splatting [57.00354758206751]
We show that the complicated pipelines for language 3D Gaussian Splatting are simply unnecessary.<n>We apply Occam's razor to the task at hand, leading to a highly efficient weighted multi-view feature aggregation technique.
arXiv Detail & Related papers (2024-12-02T18:50:37Z)
GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane [53.388937705785025]
3D open-vocabulary scene understanding is crucial for advancing augmented reality and robotic applications. We introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) Our method treats the feature selection process as a hyperplane division within the feature space, retaining only features that are highly relevant to the query.
arXiv Detail & Related papers (2024-05-27T18:57:18Z)
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting [27.974762304763694]
We introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features into a novel semantic component of 3D Gaussians. We build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference.
arXiv Detail & Related papers (2024-03-22T21:28:19Z)
HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem. Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians. Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z)
Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding [2.517953665531978]
We introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Our representation achieves the best visual quality and language querying accuracy across current language-embedded representations.
arXiv Detail & Related papers (2023-11-30T11:50:07Z)
ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames. Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.