ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
- URL: http://arxiv.org/abs/2305.08275v4
- Date: Fri, 26 Apr 2024 00:26:44 GMT
- Title: ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
- Authors: Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese,
- Abstract summary: We introduce tri-modal pre-training framework that automatically generates holistic language descriptions for 3D shapes.
It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets.
We conduct experiments on two large-scale 3D datasets, NN and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, captioning, and language for training.
Experiments show that NN-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with finetuning, and 3D (3D
- Score: 96.95120198412395
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes, their 2D counterparts, and language descriptions. However, the methods used by existing frameworks to curate such multimodal data, in particular language descriptions for 3D shapes, are not scalable, and the collected language descriptions are not diverse. To address this, we introduce ULIP-2, a simple yet effective tri-modal pre-training framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets. ULIP-2 is also equipped with scaled-up backbones for better multimodal representation learning. We conduct experiments on two large-scale 3D datasets, Objaverse and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, images, and language for training ULIP-2. Experiments show that ULIP-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with fine-tuning, and 3D captioning (3D-to-language generation). It achieves a new SOTA of 50.6% (top-1) on Objaverse-LVIS and 84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN benchmark for standard fine-tuning, ULIP-2 reaches an overall accuracy of 91.5% with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a new paradigm for scalable multimodal 3D representation learning without human annotations and shows significant improvements over existing baselines. The code and datasets are released at https://github.com/salesforce/ULIP.
Related papers
- Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D.
Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D.
We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z) - Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels [69.55622471172941]
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models.
We propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model.
arXiv Detail & Related papers (2024-04-15T21:30:50Z) - TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.