Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
- URL: http://arxiv.org/abs/2311.18482v1
- Date: Thu, 30 Nov 2023 11:50:07 GMT
- Title: Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
- Authors: Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, Shao-Hua Guan
- Abstract summary: We introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks.
Our representation achieves the best visual quality and language querying accuracy across current language-embedded representations.
- Score: 2.517953665531978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary querying in 3D space is challenging but essential for scene
understanding tasks such as object localization and segmentation.
Language-embedded scene representations have made progress by incorporating
language features into 3D spaces. However, their efficacy heavily depends on
neural networks that are resource-intensive in training and rendering. Although
recent 3D Gaussians offer efficient and high-quality novel view synthesis,
directly embedding language features in them leads to prohibitive memory usage
and decreased performance. In this work, we introduce Language Embedded 3D
Gaussians, a novel scene representation for open-vocabulary query tasks.
Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we
propose a dedicated quantization scheme that drastically alleviates the memory
requirement, and a novel embedding procedure that achieves smoother yet high
accuracy query, countering the multi-view feature inconsistencies and the
high-frequency inductive bias in point-based representations. Our comprehensive
experiments show that our representation achieves the best visual quality and
language querying accuracy across current language-embedded representations,
while maintaining real-time rendering frame rates on a single desktop GPU.
Related papers
- LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding [42.750252190275546]
LangSurf is a language-Embedded Surface Field that aligns 3D language fields with the surface of objects.
Our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing.
arXiv Detail & Related papers (2024-12-23T15:12:20Z) - SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians [77.77265204740037]
3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering.
We introduce SuperGSeg, a novel approach that fosters cohesive, context-aware scene representation.
SuperGSeg outperforms prior works on both open-vocabulary object localization and semantic segmentation tasks.
arXiv Detail & Related papers (2024-12-13T16:01:19Z) - Occam's LGS: A Simple Approach for Language Gaussian Splatting [57.00354758206751]
We show that sophisticated techniques for language-grounded 3D Gaussian Splatting are simply unnecessary.
We apply Occam's razor to the task at hand and perform weighted multi-view feature aggregation.
Our results offer us state-of-the-art results with a speed-up of two orders of magnitude.
arXiv Detail & Related papers (2024-12-02T18:50:37Z) - g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z) - GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane [53.388937705785025]
3D open-vocabulary scene understanding is crucial for advancing augmented reality and robotic applications.
We introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS)
Our method treats the feature selection process as a hyperplane division within the feature space, retaining only features that are highly relevant to the query.
arXiv Detail & Related papers (2024-05-27T18:57:18Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting [27.974762304763694]
We introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting.
Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features into a novel semantic component of 3D Gaussians.
We build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference.
arXiv Detail & Related papers (2024-03-22T21:28:19Z) - HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem.
Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians.
Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z) - Self-supervised 3D Semantic Representation Learning for
Vision-and-Language Navigation [30.429893959096752]
We develop a novel training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation.
We construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs.
Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset.
arXiv Detail & Related papers (2022-01-26T07:43:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.