CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
- URL: http://arxiv.org/abs/2301.04926v2
- Date: Thu, 6 Apr 2023 09:56:10 GMT
- Title: CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
- Authors: Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang
Li, Yuenan Hou, Yu Qiao, Wenping Wang
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning.
We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding.
We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
- Score: 55.864132158596206
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) achieves promising results in
2D zero-shot and few-shot learning. Despite the impressive performance in 2D,
applying CLIP to help the learning in 3D scene understanding has yet to be
explored. In this paper, we make the first attempt to investigate how CLIP
knowledge benefits 3D scene understanding. We propose CLIP2Scene, a simple yet
effective framework that transfers CLIP knowledge from 2D image-text
pre-trained models to a 3D point cloud network. We show that the pre-trained 3D
network yields impressive performance on various downstream tasks, i.e.,
annotation-free and fine-tuning with labelled data for semantic segmentation.
Specifically, built upon CLIP, we design a Semantic-driven Cross-modal
Contrastive Learning framework that pre-trains a 3D network via semantic and
spatial-temporal consistency regularization. For the former, we first leverage
CLIP's text semantics to select the positive and negative point samples and
then employ the contrastive loss to train the 3D network. In terms of the
latter, we force the consistency between the temporally coherent point cloud
features and their corresponding image features. We conduct experiments on
SemanticKITTI, nuScenes, and ScanNet. For the first time, our pre-trained
network achieves annotation-free 3D semantic segmentation with 20.8% and 25.08%
mIoU on nuScenes and ScanNet, respectively. When fine-tuned with 1% or 100%
labelled data, our method significantly outperforms other self-supervised
methods, with improvements of 8% and 1% mIoU, respectively. Furthermore, we
demonstrate the generalizability for handling cross-domain datasets. Code is
publicly available https://github.com/runnanchen/CLIP2Scene.
Related papers
- Bayesian Self-Training for Semi-Supervised 3D Segmentation [59.544558398992386]
3D segmentation is a core problem in computer vision.
densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive.
Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set.
arXiv Detail & Related papers (2024-09-12T14:54:31Z) - Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration [107.61458720202984]
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes.
We propose the learnable transformation alignment to bridge the domain gap between image and point cloud data.
We establish dense 2D-3D correspondences to estimate the rigid pose.
arXiv Detail & Related papers (2024-01-23T02:41:06Z) - Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic
Segmentation [17.914290294935427]
Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set.
Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks.
We propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder.
arXiv Detail & Related papers (2023-12-12T12:35:59Z) - Towards Label-free Scene Understanding by Vision Foundation Models [87.13117617056004]
We investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data.
We propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously.
Our 2D and 3D network achieves label-free semantic segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%, respectively.
arXiv Detail & Related papers (2023-06-06T17:57:49Z) - CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D
Dense CLIP [19.66617835750012]
Training a 3D scene understanding model requires complicated human annotations.
vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties.
We propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision.
arXiv Detail & Related papers (2023-03-08T17:30:58Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.