CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data
- URL: http://arxiv.org/abs/2303.12417v2
- Date: Sun, 26 Mar 2023 11:55:40 GMT
- Title: CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data
- Authors: Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye,
Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, Hang Xu
- Abstract summary: We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
- Score: 80.42480679542697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pre-training, benefiting from large-scale
unlabeled text-image pairs, has demonstrated great performance in open-world
vision understanding tasks. However, due to the limited Text-3D data pairs,
adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains
an open problem. Existing works that leverage VLM for 3D understanding
generally resort to constructing intermediate 2D representations for the 3D
data, but at the cost of losing 3D geometry information. To take a step toward
open-world 3D vision understanding, we propose Contrastive Language-Image-Point
Cloud Pretraining (CLIP$^2$) to directly learn the transferable 3D point cloud
representation in realistic scenarios with a novel proxy alignment mechanism.
Specifically, we exploit naturally-existed correspondences in 2D and 3D
scenarios, and build well-aligned and instance-based text-image-point proxies
from those complex scenarios. On top of that, we propose a cross-modal
contrastive objective to learn semantic and instance-level aligned point cloud
representation. Experimental results on both indoor and outdoor scenarios show
that our learned 3D representation has great transfer ability in downstream
tasks, including zero-shot and few-shot 3D recognition, which boosts the
state-of-the-art methods by large margins. Furthermore, we provide analyses of
the capability of different representations in real scenarios and present the
optional ensemble scheme.
Related papers
- Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling [9.440800948514449]
We propose a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling.
Our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images.
We design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes.
arXiv Detail & Related papers (2024-04-03T07:30:09Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - CLIP-Guided Vision-Language Pre-training for Question Answering in 3D
Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations.
We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings.
We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z) - Joint Representation Learning for Text and 3D Point Cloud [35.67281936143821]
We propose a novel Text4Point framework to construct language-guided 3D point cloud models.
The proposed Text4Point follows the pre-training and fine-tuning paradigm.
Our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection.
arXiv Detail & Related papers (2023-01-18T15:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.