ChatSplat: 3D Conversational Gaussian Splatting
- URL: http://arxiv.org/abs/2412.00734v1
- Date: Sun, 01 Dec 2024 08:59:30 GMT
- Title: ChatSplat: 3D Conversational Gaussian Splatting
- Authors: Hanlin Chen, Fangyin Wei, Gim Hee Lee,
- Abstract summary: ChatSplat is a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space.
For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model.
At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene.
- Score: 51.40403199909113
- License:
- Abstract: Humans naturally interact with their 3D surroundings using language, and modeling 3D language fields for scene understanding and interaction has gained growing interest. This paper introduces ChatSplat, a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space. Unlike existing methods that primarily use CLIP-derived language features focused solely on segmentation, ChatSplat facilitates interaction on three levels: objects, views, and the entire 3D scene. For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model (LLM) for conversation. At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene. For object-level interaction, ChatSplat uses a patch-wise language embedding, unlike LangSplat's pixel-wise language embedding that implicitly includes mask and embedding. Here, we explicitly decouple the language embedding into separate mask and feature map representations, allowing more flexible object-level interaction. To address the challenge of learning 3D Gaussians posed by the complex and diverse distribution of language embeddings used in the LLM, we introduce a learnable normalization technique to standardize these embeddings, facilitating effective learning. Extensive experimental results demonstrate that ChatSplat supports multi-level interactions -- object, view, and scene -- within 3D space, enhancing both understanding and engagement.
Related papers
- 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding [0.5755004576310334]
A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them.
In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph.
The learnable representation is used as input for LLMs to perform 3D vision-language tasks.
arXiv Detail & Related papers (2024-12-24T14:21:58Z) - LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding [42.750252190275546]
LangSurf is a language-Embedded Surface Field that aligns 3D language fields with the surface of objects.
Our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing.
arXiv Detail & Related papers (2024-12-23T15:12:20Z) - ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens.
Our method unifies the prompt and answer of visual referential tasks without using additional syntax.
ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - LangSplat: 3D Language Gaussian Splatting [42.16849512832556]
LangSplat constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces.
LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin.
arXiv Detail & Related papers (2023-12-26T15:14:37Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Chat-3D: Data-efficiently Tuning Large Language Model for Universal
Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications.
This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z) - LERF: Language Embedded Radiance Fields [35.925752853115476]
Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF.
LERFs learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays.
After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time.
arXiv Detail & Related papers (2023-03-16T17:59:20Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.