Related papers: LERF: Language Embedded Radiance Fields

LERF: Language Embedded Radiance Fields

URL: http://arxiv.org/abs/2303.09553v1
Date: Thu, 16 Mar 2023 17:59:20 GMT
Title: LERF: Language Embedded Radiance Fields
Authors: Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, Matthew Tancik
Abstract summary: Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF. LERFs learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays. After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time.
Score: 35.925752853115476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans describe the physical world using natural language to refer to specific 3D locations based on a vast range of properties: visual appearance, semantics, abstract associations, or actionable affordances. In this work we propose Language Embedded Radiance Fields (LERFs), a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enable these types of open-ended language queries in 3D. LERF learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays, supervising these embeddings across training views to provide multi-view consistency and smooth the underlying language field. After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time, which has potential use cases in robotics, understanding vision-language models, and interacting with 3D scenes. LERF enables pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings without relying on region proposals or masks, supporting long-tail open-vocabulary queries hierarchically across the volume. The project website can be found at https://lerf.io .

Related papers

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning [68.4209681278336]
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions. Current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals. We propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping.
arXiv Detail & Related papers (2025-03-30T03:40:35Z)
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models [58.80200897869225]
We propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions. Our results demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.
arXiv Detail & Related papers (2025-03-13T14:58:22Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene. Existing approaches commonly encounter a shortage of text3D pairs available for training. We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding [9.289977174410824]
3D visual grounding involves localizing entities in a 3D scene referred to by natural language text. We introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns.
arXiv Detail & Related papers (2025-01-02T17:20:41Z)
ChatSplat: 3D Conversational Gaussian Splatting [51.40403199909113]
ChatSplat is a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space. For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model. At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene.
arXiv Detail & Related papers (2024-12-01T08:59:30Z)
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z)
PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model [4.079327215055764]
Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world. Visual Language Models (VLMs) have excelled in high-level reasoning but fall short in grasping the nuanced physical properties required for effective human-robot interaction. We introduce PAVLM, an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud.
arXiv Detail & Related papers (2024-10-15T12:53:42Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs) It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z)
LangSplat: 3D Language Gaussian Splatting [42.16849512832556]
LangSplat constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin.
arXiv Detail & Related papers (2023-12-26T15:14:37Z)
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z)
Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback [20.151147653552155]
Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities. We propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter. Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content.
arXiv Detail & Related papers (2023-05-25T07:43:39Z)
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding [46.253711788685536]
We introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models. We devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning. Our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2% and 9.1% for semantic and instance segmentation.
arXiv Detail & Related papers (2023-04-03T13:30:04Z)
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding [57.47315482494805]
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts. We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
arXiv Detail & Related papers (2022-11-29T15:52:22Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.