PointCLIP: Point Cloud Understanding by CLIP
- URL: http://arxiv.org/abs/2112.02413v1
- Date: Sat, 4 Dec 2021 19:42:40 GMT
- Title: PointCLIP: Point Cloud Understanding by CLIP
- Authors: Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui,
Yu Qiao, Peng Gao, Hongsheng Li
- Abstract summary: We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
- Score: 77.02399444893963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, zero-shot and few-shot learning via Contrastive Vision-Language
Pre-training (CLIP) have shown inspirational performance on 2D visual
recognition, which learns to match images with their corresponding texts in
open-vocabulary settings. However, it remains under explored that whether CLIP,
pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D
recognition. In this paper, we identify such a setting is feasible by proposing
PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D
category texts. Specifically, we encode a point cloud by projecting it into
multi-view depth maps without rendering, and aggregate the view-wise zero-shot
prediction to achieve knowledge transfer from 2D to 3D. On top of that, we
design an inter-view adapter to better extract the global feature and
adaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in
2D. By just fine-tuning the lightweight adapter in the few-shot settings, the
performance of PointCLIP could be largely improved. In addition, we observe the
complementary property between PointCLIP and classical 3D-supervised networks.
By simple ensembling, PointCLIP boosts baseline's performance and even
surpasses state-of-the-art models. Therefore, PointCLIP is a promising
alternative for effective 3D point cloud understanding via CLIP under low
resource cost and data regime. We conduct thorough experiments on
widely-adopted ModelNet10, ModelNet40 and the challenging ScanObjectNN to
demonstrate the effectiveness of PointCLIP. The code is released at
https://github.com/ZrrSkywalker/PointCLIP.
Related papers
- CLIP-based Point Cloud Classification via Point Cloud to Image Translation [19.836264118079573]
Contrastive Vision-Language Pre-training (CLIP) based point cloud classification model i.e. PointCLIP has added a new direction in the point cloud classification research domain.
We propose a Pretrained Point Cloud to Image Translation Network (PPCITNet) that produces generalized colored images along with additional salient visual cues to the point cloud depth maps.
arXiv Detail & Related papers (2024-08-07T04:50:05Z) - GS-CLIP: Gaussian Splatting for Contrastive Language-Image-3D
Pretraining from Real-World Data [73.06536202251915]
3D Shape represented as point cloud has achieve advancements in multimodal pre-training to align image and language descriptions.
We propose GS-CLIP for the first attempt to introduce 3DGS into multimodal pre-training to enhance 3D representation.
arXiv Detail & Related papers (2024-02-09T05:46:47Z) - Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic
Segmentation [17.914290294935427]
Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set.
Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks.
We propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder.
arXiv Detail & Related papers (2023-12-12T12:35:59Z) - CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [55.864132158596206]
Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning.
We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding.
We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
arXiv Detail & Related papers (2023-01-12T10:42:39Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder [60.52613206271329]
This paper introduces textbfEfficient textbfPoint textbfCloud textbfLearning (EPCL) for training high-quality point cloud models with a frozen CLIP transformer.
Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data.
arXiv Detail & Related papers (2022-12-08T06:27:11Z) - PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning [40.28152121477885]
We first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2.
PointCLIP V2 fully unleashes their potential for zero-shot 3D classification, segmentation, and detection.
Our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification.
arXiv Detail & Related papers (2022-11-21T17:52:43Z) - CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth
Pre-training [121.46758260964114]
Pre-training across 3D vision and language remains under development because of limited training data.
Recent works attempt to transfer vision-language pre-training models to 3D vision.
PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification.
We propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain.
arXiv Detail & Related papers (2022-10-03T16:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.