Related papers: CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

URL: http://arxiv.org/abs/2210.01055v3
Date: Wed, 23 Aug 2023 03:24:13 GMT
Title: CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training
Authors: Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson W.H. Lau, Wanli Ouyang, Wangmeng Zuo
Abstract summary: Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. We propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain.
Score: 121.46758260964114
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. However, its performance is restricted by the domain gap between rendered depth maps and images, as well as the diversity of depth distributions. To address this issue, we propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain, and adapt it to point cloud classification. We introduce a new depth rendering setting that forms a better visual effect, and then render 52,460 pairs of images and depth maps from ShapeNet for pre-training. The pre-training scheme of CLIP2Point combines cross-modality learning to enforce the depth features for capturing expressive visual and textual features and intra-modality learning to enhance the invariance of depth aggregation. Additionally, we propose a novel Dual-Path Adapter (DPA) module, i.e., a dual-path structure with simplified adapters for few-shot learning. The dual-path structure allows the joint use of CLIP and CLIP2Point, and the simplified adapter can well fit few-shot tasks without post-search. Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms PointCLIP and other self-supervised 3D networks, achieving state-of-the-art results on zero-shot and few-shot classification.

Related papers

CLIP-based Point Cloud Classification via Point Cloud to Image Translation [19.836264118079573]
Contrastive Vision-Language Pre-training (CLIP) based point cloud classification model i.e. PointCLIP has added a new direction in the point cloud classification research domain. We propose a Pretrained Point Cloud to Image Translation Network (PPCITNet) that produces generalized colored images along with additional salient visual cues to the point cloud depth maps.
arXiv Detail & Related papers (2024-08-07T04:50:05Z)
Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation [17.914290294935427]
Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set. Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks. We propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder.
arXiv Detail & Related papers (2023-12-12T12:35:59Z)
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. We augment CLIP training with task-specific vision models from model zoos to improve its visual representations. This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z)
CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [55.864132158596206]
Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning. We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding. We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
arXiv Detail & Related papers (2023-01-12T10:42:39Z)
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z)
PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models [56.324516906160234]
Generalizable 3D part segmentation is important but challenging in vision and robotics. This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model, GLIP. We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm.
arXiv Detail & Related papers (2022-12-03T06:59:01Z)
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning [40.28152121477885]
We first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2. PointCLIP V2 fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. Our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification.
arXiv Detail & Related papers (2022-11-21T17:52:43Z)
PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts. PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.