PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
- URL: http://arxiv.org/abs/2211.11682v2
- Date: Sat, 26 Aug 2023 16:14:09 GMT
- Title: PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
- Authors: Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng
Qin, Shanghang Zhang, Peng Gao
- Abstract summary: We first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2.
PointCLIP V2 fully unleashes their potential for zero-shot 3D classification, segmentation, and detection.
Our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification.
- Score: 40.28152121477885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pre-trained models have shown promising open-world performance
for both vision and language tasks. However, their transferred capacity on 3D
point clouds is still limited and only constrained to the classification task.
In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world
learner, named as PointCLIP V2, which fully unleashes their potential for
zero-shot 3D classification, segmentation, and detection. To better align 3D
data with the pre-trained language knowledge, PointCLIP V2 contains two key
designs. For the visual end, we prompt CLIP via a shape projection module to
generate more realistic depth maps, narrowing the domain gap between projected
point clouds with natural images. For the textual end, we prompt the GPT model
to generate 3D-specific text as the input of CLIP's textual encoder. Without
any training in 3D domains, our approach significantly surpasses PointCLIP by
+42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D
classification. On top of that, V2 can be extended to few-shot 3D
classification, zero-shot 3D part segmentation, and 3D object detection in a
simple manner, demonstrating our generalization ability for unified 3D
open-world learning.
Related papers
- Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic
Segmentation [17.914290294935427]
Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set.
Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks.
We propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder.
arXiv Detail & Related papers (2023-12-12T12:35:59Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained
Image-Language Models [56.324516906160234]
Generalizable 3D part segmentation is important but challenging in vision and robotics.
This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model, GLIP.
We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm.
arXiv Detail & Related papers (2022-12-03T06:59:01Z) - CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth
Pre-training [121.46758260964114]
Pre-training across 3D vision and language remains under development because of limited training data.
Recent works attempt to transfer vision-language pre-training models to 3D vision.
PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification.
We propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain.
arXiv Detail & Related papers (2022-10-03T16:13:14Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.