SplatTalk: 3D VQA with Gaussian Splatting
- URL: http://arxiv.org/abs/2503.06271v1
- Date: Sat, 08 Mar 2025 16:31:48 GMT
- Title: SplatTalk: 3D VQA with Gaussian Splatting
- Authors: Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, Thomas Funkhouser,
- Abstract summary: Language-guided 3D scene understanding is important for advancing applications in robotics, AR/VR, and human-computer interaction.<n>We introduce SplatTalk, a novel method that uses a generalizable 3D Gaussian Splatting (3DGS) framework to produce 3D tokens suitable for direct input into a pretrained LLM.
- Score: 13.211810095081159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language-guided 3D scene understanding is important for advancing applications in robotics, AR/VR, and human-computer interaction, enabling models to comprehend and interact with 3D environments through natural language. While 2D vision-language models (VLMs) have achieved remarkable success in 2D VQA tasks, progress in the 3D domain has been significantly slower due to the complexity of 3D data and the high cost of manual annotations. In this work, we introduce SplatTalk, a novel method that uses a generalizable 3D Gaussian Splatting (3DGS) framework to produce 3D tokens suitable for direct input into a pretrained LLM, enabling effective zero-shot 3D visual question answering (3D VQA) for scenes with only posed images. During experiments on multiple benchmarks, our approach outperforms both 3D models trained specifically for the task and previous 2D-LMM-based models utilizing only images (our setting), while achieving competitive performance with state-of-the-art 3D LMMs that additionally utilize 3D inputs.
Related papers
- Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.
UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z) - UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting [68.37013525040891]
We propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation.<n>We demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation.
arXiv Detail & Related papers (2025-02-25T05:10:22Z) - 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding [49.15555885075644]
We develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs.<n>We introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes.
arXiv Detail & Related papers (2025-01-14T03:50:23Z) - Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding [19.382210260928776]
Video-3D LLM treats 3D scenes as dynamic videos and incorporates 3D position encoding into these representations.<n>Our model achieves state-of-the-art performance on several 3D scene understanding benchmarks.
arXiv Detail & Related papers (2024-11-30T14:28:53Z) - Transcrib3D: 3D Referring Expression Resolution through Large Language Models [28.121606686759225]
We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models.
Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks.
We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions.
arXiv Detail & Related papers (2024-04-30T02:48:20Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.