Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic
Segmentation
- URL: http://arxiv.org/abs/2312.07221v1
- Date: Tue, 12 Dec 2023 12:35:59 GMT
- Title: Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic
Segmentation
- Authors: Yuanbin Wang, Shaofei Huang, Yulu Gao, Zhen Wang, Rui Wang, Kehua
Sheng, Bo Zhang, Si Liu
- Abstract summary: Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set.
Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks.
We propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder.
- Score: 17.914290294935427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional 3D segmentation methods can only recognize a fixed range of
classes that appear in the training set, which limits their application in
real-world scenarios due to the lack of generalization ability. Large-scale
visual-language pre-trained models, such as CLIP, have shown their
generalization ability in the zero-shot 2D vision tasks, but are still unable
to be applied to 3D semantic segmentation directly. In this work, we focus on
zero-shot point cloud semantic segmentation and propose a simple yet effective
baseline to transfer the visual-linguistic knowledge implied in CLIP to point
cloud encoder at both feature and output levels. Both feature-level and
output-level alignments are conducted between 2D and 3D encoders for effective
knowledge transfer. Concretely, a Multi-granularity Cross-modal Feature
Alignment (MCFA) module is proposed to align 2D and 3D features from global
semantic and local position perspectives for feature-level alignment. For the
output level, per-pixel pseudo labels of unseen classes are extracted using the
pre-trained CLIP model as supervision for the 3D segmentation model to mimic
the behavior of the CLIP image encoder. Extensive experiments are conducted on
two popular benchmarks of point cloud segmentation. Our method outperforms
significantly previous state-of-the-art methods under zero-shot setting (+29.2%
mIoU on SemanticKITTI and 31.8% mIoU on nuScenes), and further achieves
promising results in the annotation-free point cloud semantic segmentation
setting, showing its great potential for label-efficient learning.
Related papers
- Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels [69.55622471172941]
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models.
We propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model.
arXiv Detail & Related papers (2024-04-15T21:30:50Z) - CLIPose: Category-Level Object Pose Estimation with Pre-trained
Vision-Language Knowledge [18.57081150228812]
We propose a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information.
CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS)
arXiv Detail & Related papers (2024-02-24T05:31:53Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [55.864132158596206]
Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning.
We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding.
We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
arXiv Detail & Related papers (2023-01-12T10:42:39Z) - PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning [40.28152121477885]
We first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2.
PointCLIP V2 fully unleashes their potential for zero-shot 3D classification, segmentation, and detection.
Our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification.
arXiv Detail & Related papers (2022-11-21T17:52:43Z) - CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth
Pre-training [121.46758260964114]
Pre-training across 3D vision and language remains under development because of limited training data.
Recent works attempt to transfer vision-language pre-training models to 3D vision.
PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification.
We propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain.
arXiv Detail & Related papers (2022-10-03T16:13:14Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z) - Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point
Clouds of Wild Scenes [36.07733308424772]
The deficiency of 3D segmentation labels is one of the main obstacles to effective point cloud segmentation.
We propose a novel deep graph convolutional network-based framework for large-scale semantic scene segmentation in point clouds with sole 2D supervision.
arXiv Detail & Related papers (2020-04-26T23:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.