Related papers: Grounded 3D-LLM with Referent Tokens

Grounded 3D-LLM with Referent Tokens

URL: http://arxiv.org/abs/2405.10370v1
Date: Thu, 16 May 2024 18:03:41 GMT
Title: Grounded 3D-LLM with Referent Tokens
Authors: Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang,
Abstract summary: We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates.
Score: 58.890058568493096
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, we have curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets will be released on the project page: https://groundedscenellm.github.io/grounded_3d-llm.github.io.

Related papers

Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding [5.568166420745467]
3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. We introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU. We also propose Ges3ViG, a novel model for 3D-ERU that achieves 30% improvement in accuracy as compared to other 3D-ERU models and 9% compared to other purely language-based 3D grounding models.
arXiv Detail & Related papers (2025-04-13T15:43:06Z)
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes [10.139461308573336]
IRef-VLA is the largest real-world dataset for the referential grounding task consisting of over 11.5K scanned 3D rooms. We aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems.
arXiv Detail & Related papers (2025-03-20T16:16:10Z)
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding [9.289977174410824]
3D visual grounding involves localizing entities in a 3D scene referred to by natural language text. We introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns.
arXiv Detail & Related papers (2025-01-02T17:20:41Z)
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks [62.74304008688472]
Generalizable 3D-Language Feature Fields (g3D-LF) is a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks.
arXiv Detail & Related papers (2024-11-26T01:54:52Z)
Transcrib3D: 3D Referring Expression Resolution through Large Language Models [28.121606686759225]
We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models. Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks. We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions.
arXiv Detail & Related papers (2024-04-30T02:48:20Z)
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z)
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information. We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.