Related papers: Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

URL: http://arxiv.org/abs/2308.08769v1
Date: Thu, 17 Aug 2023 03:52:15 GMT
Title: Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes
Authors: Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao
Abstract summary: 3D scene understanding has gained significant attention due to its wide range of applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
Score: 56.727745047799246
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. Our experiments show that Chat-3D achieves an impressive ability to comprehend diverse instructions for 3D scenes, engage in intricate spatial reasoning, and incorporate external knowledge into its responses. Chat-3D achieves a 75.6% relative score compared with GPT-4 on the constructed instruction dataset.

Related papers

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models [12.545622346725544]
New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks.<n>We propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks.<n>We introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model's spatial awareness capabilities.
arXiv Detail & Related papers (2025-07-22T12:32:35Z)
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z)
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation. We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z)
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models [45.28780381341979]
We introduce a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks. We also propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module.
arXiv Detail & Related papers (2024-10-04T19:22:20Z)
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness [22.408933972095763]
We introduce a simple yet effective framework called LLaVA-3D. It efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets.
arXiv Detail & Related papers (2024-09-26T17:59:11Z)
Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding. We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z)
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning [24.162598399141785]
Scene-LLM is a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning.
arXiv Detail & Related papers (2024-03-18T01:18:48Z)
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information. We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z)
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning [42.61001274381612]
We present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Captioning and 3D Question Answering.
arXiv Detail & Related papers (2023-11-30T16:00:23Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations. We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings. We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z)
CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.