ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding
- URL: http://arxiv.org/abs/2212.05171v4
- Date: Mon, 12 Jun 2023 19:30:52 GMT
- Title: ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding
- Authors: Le Xue, Mingfei Gao, Chen Xing, Roberto Mart\'in-Mart\'in, Jiajun Wu,
Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese
- Abstract summary: Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
- Score: 110.07170245531464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recognition capabilities of current state-of-the-art 3D models are
limited by datasets with a small number of annotated data and a pre-defined set
of categories. In its 2D counterpart, recent advances have shown that similar
problems can be significantly alleviated by employing knowledge from other
modalities, such as language. Inspired by this, leveraging multimodal
information for 3D modality could be promising to improve 3D understanding
under the restricted data regime, but this line of research is not well
studied. Therefore, we introduce ULIP to learn a unified representation of
images, texts, and 3D point clouds by pre-training with object triplets from
the three modalities. To overcome the shortage of training triplets, ULIP
leverages a pre-trained vision-language model that has already learned a common
visual and textual space by training with massive image-text pairs. Then, ULIP
learns a 3D representation space aligned with the common image-text space,
using a small number of automatically synthesized triplets. ULIP is agnostic to
3D backbone networks and can easily be integrated into any 3D architecture.
Experiments show that ULIP effectively improves the performance of multiple
recent 3D backbones by simply pre-training them on ShapeNet55 using our
framework, achieving state-of-the-art performance in both standard 3D
classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN.
ULIP also improves the performance of PointMLP by around 3% in 3D
classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1
accuracy for zero-shot 3D classification on ModelNet40. Our code and
pre-trained models are released at https://github.com/salesforce/ULIP.
Related papers
- TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - GS-CLIP: Gaussian Splatting for Contrastive Language-Image-3D
Pretraining from Real-World Data [73.06536202251915]
3D Shape represented as point cloud has achieve advancements in multimodal pre-training to align image and language descriptions.
We propose GS-CLIP for the first attempt to introduce 3DGS into multimodal pre-training to enhance 3D representation.
arXiv Detail & Related papers (2024-02-09T05:46:47Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding [96.95120198412395]
We introduce tri-modal pre-training framework that automatically generates holistic language descriptions for 3D shapes.
It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets.
We conduct experiments on two large-scale 3D datasets, NN and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, captioning, and language for training.
Experiments show that NN-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with finetuning, and 3D (3D
arXiv Detail & Related papers (2023-05-14T23:14:09Z) - CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [55.864132158596206]
Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning.
We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding.
We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
arXiv Detail & Related papers (2023-01-12T10:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.