3D-LLM: Injecting the 3D World into Large Language Models
- URL: http://arxiv.org/abs/2307.12981v1
- Date: Mon, 24 Jul 2023 17:59:02 GMT
- Title: 3D-LLM: Injecting the 3D World into Large Language Models
- Authors: Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du,
Zhenfang Chen, Chuang Gan
- Abstract summary: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
- Score: 60.43823088804661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) and Vision-Language Models (VLMs) have been
proven to excel at multiple tasks, such as commonsense reasoning. Powerful as
these models can be, they are not grounded in the 3D physical world, which
involves richer concepts such as spatial relationships, affordances, physics,
layout, and so on. In this work, we propose to inject the 3D world into large
language models and introduce a whole new family of 3D-LLMs. Specifically,
3D-LLMs can take 3D point clouds and their features as input and perform a
diverse set of 3D-related tasks, including captioning, dense captioning, 3D
question answering, task decomposition, 3D grounding, 3D-assisted dialog,
navigation, and so on. Using three types of prompting mechanisms that we
design, we are able to collect over 300k 3D-language data covering these tasks.
To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that
obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as
our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism,
3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show
that our model outperforms state-of-the-art baselines by a large margin (e.g.,
the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore,
experiments on our held-in datasets for 3D captioning, task composition, and
3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative
examples also show that our model could perform more tasks beyond the scope of
existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.
Related papers
- 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination [22.029496025779405]
3D-GRAND is a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions.
Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs.
As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs.
arXiv Detail & Related papers (2024-06-07T17:59:59Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts [30.571811801090224]
We introduce a comprehensive 3D instructionfollowing dataset called M3DBench.
It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts.
It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments.
arXiv Detail & Related papers (2023-12-17T16:53:30Z) - LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding,
Reasoning, and Planning [42.61001274381612]
We present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts.
Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Captioning and 3D Question Answering.
arXiv Detail & Related papers (2023-11-30T16:00:23Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D
Understanding, Generation, and Instruction Following [88.39360296377589]
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video.
We also present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions.
arXiv Detail & Related papers (2023-09-01T17:59:47Z) - Chat-3D: Data-efficiently Tuning Large Language Model for Universal
Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications.
This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.