Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud
Pre-training
- URL: http://arxiv.org/abs/2205.14401v1
- Date: Sat, 28 May 2022 11:22:53 GMT
- Title: Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud
Pre-training
- Authors: Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang,
Yu Qiao, Hongsheng Li
- Abstract summary: Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers.
We propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds.
- Score: 56.81809311892475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked Autoencoders (MAE) have shown great potentials in self-supervised
pre-training for language and 2D image transformers. However, it still remains
an open question on how to exploit masked autoencoding for learning 3D
representations of irregular point clouds. In this paper, we propose
Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical
self-supervised learning of 3D point clouds. Unlike the standard transformer in
MAE, we modify the encoder and decoder into pyramid architectures to
progressively model spatial geometries and capture both fine-grained and
high-level semantics of 3D shapes. For the encoder that downsamples point
tokens by stages, we design a multi-scale masking strategy to generate
consistent visible regions across scales, and adopt a local spatial
self-attention mechanism to focus on neighboring patterns. By multi-scale token
propagation, the lightweight decoder gradually upsamples point tokens with
complementary skip connections from the encoder, which further promotes the
reconstruction from a global-to-local perspective. Extensive experiments
demonstrate the state-of-the-art performance of Point-M2AE for 3D
representation learning. With a frozen encoder after pre-training, Point-M2AE
achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some
fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves
86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely
benefits the few-shot classification, part segmentation and 3D object detection
with the hierarchical pre-training scheme. Code will be available at
https://github.com/ZrrSkywalker/Point-M2AE.
Related papers
- Triple Point Masking [49.39218611030084]
Existing 3D mask learning methods encounter performance bottlenecks under limited data.
We introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders.
Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.
arXiv Detail & Related papers (2024-09-26T05:33:30Z) - Point Cloud Self-supervised Learning via 3D to Multi-view Masked
Autoencoder [21.73287941143304]
Multi-Modality Masked AutoEncoders (MAE) methods leverage both 2D images and 3D point clouds for pre-training.
We introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds.
Our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks.
arXiv Detail & Related papers (2023-11-17T22:10:03Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders [52.91248611338202]
We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
arXiv Detail & Related papers (2022-12-13T17:59:20Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - Masked Autoencoders in 3D Point Cloud Representation Learning [7.617783375837524]
We propose masked Autoencoders in 3D point cloud representation learning (abbreviated as MAE3D)
We first split the input point cloud into patches and mask a portion of them, then use our Patch Embedding Module to extract the features of unmasked patches.
Comprehensive experiments demonstrate that the local features extracted by our MAE3D from point cloud patches are beneficial for downstream classification tasks.
arXiv Detail & Related papers (2022-07-04T16:13:27Z) - Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation [61.98690211671168]
We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
arXiv Detail & Related papers (2021-09-06T09:06:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.