Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D
Understanding, Generation, and Instruction Following
- URL: http://arxiv.org/abs/2309.00615v1
- Date: Fri, 1 Sep 2023 17:59:47 GMT
- Title: Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D
Understanding, Generation, and Instruction Following
- Authors: Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma,
Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
- Abstract summary: We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video.
We also present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions.
- Score: 88.39360296377589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Point-Bind, a 3D multi-modality model aligning point clouds with
2D image, language, audio, and video. Guided by ImageBind, we construct a joint
embedding space between 3D and multi-modalities, enabling many promising
applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D
open-world understanding. On top of this, we further present Point-LLM, the
first 3D large language model (LLM) following 3D multi-modal instructions. By
parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of
Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction
data, but exhibits superior 3D and multi-modal question-answering capacity. We
hope our work may cast a light on the community for extending 3D point clouds
to multi-modality applications. Code is available at
https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.
Related papers
- Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D.
Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D.
We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z) - ShapeLLM: Universal 3D Object Understanding for Embodied Interaction [37.0434133128805]
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction.
ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++.
ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet.
arXiv Detail & Related papers (2024-02-27T18:57:12Z) - M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts [30.571811801090224]
We introduce a comprehensive 3D instructionfollowing dataset called M3DBench.
It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts.
It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments.
arXiv Detail & Related papers (2023-12-17T16:53:30Z) - GPT4Point: A Unified Framework for Point-Language Understanding and
Generation [76.61439685940272]
GPT4Point is a groundbreaking point-language multimodal model for unified 3D object understanding and generation within the MLLM framework.
GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A.
It can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors.
arXiv Detail & Related papers (2023-12-05T18:59:55Z) - LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding,
Reasoning, and Planning [42.61001274381612]
We present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts.
Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Captioning and 3D Question Answering.
arXiv Detail & Related papers (2023-11-30T16:00:23Z) - Point Cloud Self-supervised Learning via 3D to Multi-view Masked
Autoencoder [21.73287941143304]
Multi-Modality Masked AutoEncoders (MAE) methods leverage both 2D images and 3D point clouds for pre-training.
We introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds.
Our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks.
arXiv Detail & Related papers (2023-11-17T22:10:03Z) - Chat-3D: Data-efficiently Tuning Large Language Model for Universal
Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications.
This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding [96.95120198412395]
We introduce tri-modal pre-training framework that automatically generates holistic language descriptions for 3D shapes.
It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets.
We conduct experiments on two large-scale 3D datasets, NN and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, captioning, and language for training.
Experiments show that NN-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with finetuning, and 3D (3D
arXiv Detail & Related papers (2023-05-14T23:14:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.