LangSplat: 3D Language Gaussian Splatting
- URL: http://arxiv.org/abs/2312.16084v2
- Date: Sun, 31 Mar 2024 04:45:58 GMT
- Title: LangSplat: 3D Language Gaussian Splatting
- Authors: Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, Hanspeter Pfister,
- Abstract summary: LangSplat constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces.
LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin.
- Score: 42.16849512832556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a 199 $\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io/
Related papers
- GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane [53.388937705785025]
3D open-vocabulary scene understanding is crucial for advancing augmented reality and robotic applications.
We introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS)
Our method treats the feature selection process as a hyperplane division within the feature space, retaining only features that are highly relevant to the query.
arXiv Detail & Related papers (2024-05-27T18:57:18Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Transcrib3D: 3D Referring Expression Resolution through Large Language Models [28.121606686759225]
We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models.
Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks.
We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions.
arXiv Detail & Related papers (2024-04-30T02:48:20Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language
Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment.
We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline.
Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - LERF: Language Embedded Radiance Fields [35.925752853115476]
Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF.
LERFs learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays.
After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time.
arXiv Detail & Related papers (2023-03-16T17:59:20Z) - PLA: Language-Driven Open-Vocabulary 3D Scene Understanding [57.47315482494805]
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space.
Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts.
We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
arXiv Detail & Related papers (2022-11-29T15:52:22Z) - Looking Outside the Box to Ground Language in 3D Scenes [27.126171549887232]
We propose a model for grounding language in 3D scenes with three main innovations.
Iterative attention across the language stream, the point cloud feature stream and 3D box proposals.
Joint supervision from 3D object annotations and language grounding annotations.
When applied on language grounding on 2D images with minor changes, it performs on par with the state-of-the-art while converges in half of the GPU time.
arXiv Detail & Related papers (2021-12-16T13:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.