LERF: Language Embedded Radiance Fields
- URL: http://arxiv.org/abs/2303.09553v1
- Date: Thu, 16 Mar 2023 17:59:20 GMT
- Title: LERF: Language Embedded Radiance Fields
- Authors: Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, Matthew
Tancik
- Abstract summary: Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF.
LERFs learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays.
After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time.
- Score: 35.925752853115476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans describe the physical world using natural language to refer to
specific 3D locations based on a vast range of properties: visual appearance,
semantics, abstract associations, or actionable affordances. In this work we
propose Language Embedded Radiance Fields (LERFs), a method for grounding
language embeddings from off-the-shelf models like CLIP into NeRF, which enable
these types of open-ended language queries in 3D. LERF learns a dense,
multi-scale language field inside NeRF by volume rendering CLIP embeddings
along training rays, supervising these embeddings across training views to
provide multi-view consistency and smooth the underlying language field. After
optimization, LERF can extract 3D relevancy maps for a broad range of language
prompts interactively in real-time, which has potential use cases in robotics,
understanding vision-language models, and interacting with 3D scenes. LERF
enables pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings
without relying on region proposals or masks, supporting long-tail
open-vocabulary queries hierarchically across the volume. The project website
can be found at https://lerf.io .
Related papers
- AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.
Existing approaches commonly encounter a shortage of text3D pairs available for training.
We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z) - ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding [9.289977174410824]
3D visual grounding involves localizing entities in a 3D scene referred to by natural language text.
We introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns.
arXiv Detail & Related papers (2025-01-02T17:20:41Z) - ChatSplat: 3D Conversational Gaussian Splatting [51.40403199909113]
ChatSplat is a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space.
For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model.
At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene.
arXiv Detail & Related papers (2024-12-01T08:59:30Z) - g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - LangSplat: 3D Language Gaussian Splatting [42.16849512832556]
LangSplat constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces.
LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin.
arXiv Detail & Related papers (2023-12-26T15:14:37Z) - LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language
Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment.
We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline.
Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z) - Towards Language-guided Interactive 3D Generation: LLMs as Layout
Interpreter with Generative Feedback [20.151147653552155]
Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities.
We propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter.
Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content.
arXiv Detail & Related papers (2023-05-25T07:43:39Z) - RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding [46.253711788685536]
We introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models.
We devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning.
Our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2% and 9.1% for semantic and instance segmentation.
arXiv Detail & Related papers (2023-04-03T13:30:04Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.