EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder
- URL: http://arxiv.org/abs/2212.04098v3
- Date: Sun, 10 Dec 2023 16:47:58 GMT
- Title: EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder
- Authors: Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou,
Yifan Zuo, Wanli Ouyang
- Abstract summary: This paper introduces textbfEfficient textbfPoint textbfCloud textbfLearning (EPCL) for training high-quality point cloud models with a frozen CLIP transformer.
Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data.
- Score: 60.52613206271329
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pretrain-finetune paradigm has achieved great success in NLP and 2D image
fields because of the high-quality representation ability and transferability
of their pretrained models. However, pretraining such a strong model is
difficult in the 3D point cloud field due to the limited amount of point cloud
sequences. This paper introduces \textbf{E}fficient \textbf{P}oint
\textbf{C}loud \textbf{L}earning (EPCL), an effective and efficient point cloud
learner for directly training high-quality point cloud models with a frozen
CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically
aligning the image features and point cloud features without paired 2D-3D data.
Specifically, the input point cloud is divided into a series of local patches,
which are converted to token embeddings by the designed point cloud tokenizer.
These token embeddings are concatenated with a task token and fed into the
frozen CLIP transformer to learn point cloud representation. The intuition is
that the proposed point cloud tokenizer projects the input point cloud into a
unified token space that is similar to the 2D images. Comprehensive experiments
on 3D detection, semantic segmentation, classification and few-shot learning
demonstrate that the CLIP transformer can serve as an efficient point cloud
encoder and our method achieves promising performance on both indoor and
outdoor benchmarks. In particular, performance gains brought by our EPCL are
$\textbf{19.7}$ AP$_{50}$ on ScanNet V2 detection, $\textbf{4.4}$ mIoU on S3DIS
segmentation and $\textbf{1.2}$ mIoU on SemanticKITTI segmentation compared to
contemporary pretrained models. Code is available at
\url{https://github.com/XiaoshuiHuang/EPCL}.
Related papers
- P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising [81.92854168911704]
We tackle the task of point cloud denoising through a novel framework that adapts Diffusion Schr"odinger bridges to points clouds.
Experiments on object datasets show that P2P-Bridge achieves significant improvements over existing methods.
arXiv Detail & Related papers (2024-08-29T08:00:07Z) - Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments.
We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs)
SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z) - PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds [18.840000859663153]
We propose PRED, a novel image-assisted pre-training framework for outdoor point clouds.
The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering.
We further enhance our model's performance by incorporating point-wise masking with a high mask ratio.
arXiv Detail & Related papers (2023-11-08T07:26:09Z) - 2D-3D Interlaced Transformer for Point Cloud Segmentation with
Scene-Level Supervision [36.282611420496416]
We propose a transformer model with two encoders and one decoder for weakly supervised point cloud segmentation.
The decoder implements 2D-3D cross-attention and carries out implicit 2D and 3D feature fusion.
Experiments show that it performs favorably against existing weakly supervised point cloud segmentation methods.
arXiv Detail & Related papers (2023-10-19T15:12:44Z) - Point2Vec for Self-Supervised Representation Learning on Point Clouds [66.53955515020053]
We extend data2vec to the point cloud domain and report encouraging results on several downstream tasks.
We propose point2vec, which unleashes the full potential of data2vec-like pre-training on point clouds.
arXiv Detail & Related papers (2023-03-29T10:08:29Z) - Masked Autoencoders in 3D Point Cloud Representation Learning [7.617783375837524]
We propose masked Autoencoders in 3D point cloud representation learning (abbreviated as MAE3D)
We first split the input point cloud into patches and mask a portion of them, then use our Patch Embedding Module to extract the features of unmasked patches.
Comprehensive experiments demonstrate that the local features extracted by our MAE3D from point cloud patches are beneficial for downstream classification tasks.
arXiv Detail & Related papers (2022-07-04T16:13:27Z) - Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud
Pre-training [56.81809311892475]
Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers.
We propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds.
arXiv Detail & Related papers (2022-05-28T11:22:53Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z) - Point Cloud Pre-training by Mixing and Disentangling [35.18101910728478]
Mixing and Disentangling (MD) is a self-supervised learning approach for point cloud pre-training.
We show that the encoder + ours (MD) significantly surpasses that of the encoder trained from scratch and converges quickly.
We hope this self-supervised learning attempt on point clouds can pave the way for reducing the deeply-learned model dependence on large-scale labeled data.
arXiv Detail & Related papers (2021-09-01T15:52:18Z) - SSPU-Net: Self-Supervised Point Cloud Upsampling via Differentiable
Rendering [21.563862632172363]
We propose a self-supervised point cloud upsampling network (SSPU-Net) to generate dense point clouds without using ground truth.
To achieve this, we exploit the consistency between the input sparse point cloud and generated dense point cloud for the shapes and rendered images.
arXiv Detail & Related papers (2021-08-01T13:26:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.