Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D
Understanding, Generation, and Instruction Following
- URL: http://arxiv.org/abs/2309.00615v1
- Date: Fri, 1 Sep 2023 17:59:47 GMT
- Title: Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D
Understanding, Generation, and Instruction Following
- Authors: Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma,
Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
- Abstract summary: We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video.
We also present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions.
- Score: 88.39360296377589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Point-Bind, a 3D multi-modality model aligning point clouds with
2D image, language, audio, and video. Guided by ImageBind, we construct a joint
embedding space between 3D and multi-modalities, enabling many promising
applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D
open-world understanding. On top of this, we further present Point-LLM, the
first 3D large language model (LLM) following 3D multi-modal instructions. By
parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of
Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction
data, but exhibits superior 3D and multi-modal question-answering capacity. We
hope our work may cast a light on the community for extending 3D point clouds
to multi-modality applications. Code is available at
https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.
Related papers
- 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding [49.15555885075644]
We develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs.
We introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes.
arXiv Detail & Related papers (2025-01-14T03:50:23Z) - 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer [33.42183318484381]
We introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world.
At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities.
arXiv Detail & Related papers (2025-01-02T09:33:13Z) - Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D.
Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D.
We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z) - ShapeLLM: Universal 3D Object Understanding for Embodied Interaction [37.0434133128805]
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction.
ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++.
ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet.
arXiv Detail & Related papers (2024-02-27T18:57:12Z) - GPT4Point: A Unified Framework for Point-Language Understanding and
Generation [76.61439685940272]
GPT4Point is a groundbreaking point-language multimodal model for unified 3D object understanding and generation within the MLLM framework.
GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A.
It can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors.
arXiv Detail & Related papers (2023-12-05T18:59:55Z) - Point Cloud Self-supervised Learning via 3D to Multi-view Masked
Autoencoder [21.73287941143304]
Multi-Modality Masked AutoEncoders (MAE) methods leverage both 2D images and 3D point clouds for pre-training.
We introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds.
Our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks.
arXiv Detail & Related papers (2023-11-17T22:10:03Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding [96.95120198412395]
We introduce tri-modal pre-training framework that automatically generates holistic language descriptions for 3D shapes.
It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets.
We conduct experiments on two large-scale 3D datasets, NN and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, captioning, and language for training.
Experiments show that NN-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with finetuning, and 3D (3D
arXiv Detail & Related papers (2023-05-14T23:14:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.