Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders
- URL: http://arxiv.org/abs/2212.06785v1
- Date: Tue, 13 Dec 2022 17:59:20 GMT
- Title: Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders
- Authors: Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, Hongsheng Li
- Abstract summary: We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
- Score: 52.91248611338202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training by numerous image data has become de-facto for robust 2D
representations. In contrast, due to the expensive data acquisition and
annotation, a paucity of large-scale 3D datasets severely hinders the learning
for high-quality 3D features. In this paper, we propose an alternative to
obtain superior 3D representations from 2D pre-trained models via
Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised
pre-training, we leverage the well learned 2D knowledge to guide 3D masked
autoencoding, which reconstructs the masked point tokens with an
encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D
models to extract the multi-view visual features of the input point cloud, and
then conduct two types of image-to-point learning schemes on top. For one, we
introduce a 2D-guided masking strategy that maintains semantically important
point tokens to be visible for the encoder. Compared to random masking, the
network can better concentrate on significant 3D structures and recover the
masked tokens from key spatial cues. For another, we enforce these visible
tokens to reconstruct the corresponding multi-view 2D features after the
decoder. This enables the network to effectively inherit high-level 2D
semantics learned from rich image data for discriminative 3D modeling. Aided by
our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning,
achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fully
trained results of existing methods. By further fine-tuning on on
ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11%
accuracy, +3.68% to the second-best, demonstrating superior transferable
capacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.
Related papers
- NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields [57.617972778377215]
We show how to generate effective 3D representations from posed RGB images.
We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images.
Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks.
arXiv Detail & Related papers (2024-04-01T17:59:55Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Multi-View Representation is What You Need for Point-Cloud Pre-Training [22.55455166875263]
This paper proposes a novel approach to point-cloud pre-training that learns 3D representations by leveraging pre-trained 2D networks.
We train the 3D feature extraction network with the help of the novel 2D knowledge transfer loss.
Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks.
arXiv Detail & Related papers (2023-06-05T03:14:54Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud
Pre-training [56.81809311892475]
Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers.
We propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds.
arXiv Detail & Related papers (2022-05-28T11:22:53Z) - Data Efficient 3D Learner via Knowledge Transferred from 2D Model [30.077342050473515]
We deal with the data scarcity challenge of 3D tasks by transferring knowledge from strong 2D models via RGB-D images.
We utilize a strong and well-trained semantic segmentation model for 2D images to augment RGB-D images with pseudo-label.
Our method already outperforms existing state-of-the-art that is tailored for 3D label efficiency.
arXiv Detail & Related papers (2022-03-16T09:14:44Z) - Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets.
Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.