Masked Autoencoders for Point Cloud Self-supervised Learning
- URL: http://arxiv.org/abs/2203.06604v1
- Date: Sun, 13 Mar 2022 09:23:39 GMT
- Title: Masked Autoencoders for Point Cloud Self-supervised Learning
- Authors: Yatian Pang, Wenxiao Wang, Francis E.H. Tay, Wei Liu, Yonghong Tian,
Li Yuan
- Abstract summary: We propose a neat scheme of masked autoencoders for point cloud self-supervised learning.
We divide the input point cloud into irregular point patches and randomly mask them at a high ratio.
A standard Transformer based autoencoder, with an asymmetric design and a shifting mask tokens operation, learns high-level latent features from unmasked point patches.
- Score: 27.894216954216716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a promising scheme of self-supervised learning, masked autoencoding has
significantly advanced natural language processing and computer vision.
Inspired by this, we propose a neat scheme of masked autoencoders for point
cloud self-supervised learning, addressing the challenges posed by point
cloud's properties, including leakage of location information and uneven
information density. Concretely, we divide the input point cloud into irregular
point patches and randomly mask them at a high ratio. Then, a standard
Transformer based autoencoder, with an asymmetric design and a shifting mask
tokens operation, learns high-level latent features from unmasked point
patches, aiming to reconstruct the masked point patches. Extensive experiments
show that our approach is efficient during pre-training and generalizes well on
various downstream tasks. Specifically, our pre-trained models achieve 84.52\%
accuracy on ScanObjectNN and 94.04% accuracy on ModelNet40, outperforming all
the other self-supervised learning methods. We show with our scheme, a simple
architecture entirely based on standard Transformers can surpass dedicated
Transformer models from supervised learning. Our approach also advances
state-of-the-art accuracies by 1.5%-2.3% in the few-shot object classification.
Furthermore, our work inspires the feasibility of applying unified
architectures from languages and images to the point cloud.
Related papers
- Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - SeRP: Self-Supervised Representation Learning Using Perturbed Point
Clouds [6.29475963948119]
SeRP consists of encoder-decoder architecture that takes perturbed or corrupted point clouds as inputs.
We have used Transformers and PointNet-based Autoencoders.
arXiv Detail & Related papers (2022-09-13T15:22:36Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Masked Autoencoders in 3D Point Cloud Representation Learning [7.617783375837524]
We propose masked Autoencoders in 3D point cloud representation learning (abbreviated as MAE3D)
We first split the input point cloud into patches and mask a portion of them, then use our Patch Embedding Module to extract the features of unmasked patches.
Comprehensive experiments demonstrate that the local features extracted by our MAE3D from point cloud patches are beneficial for downstream classification tasks.
arXiv Detail & Related papers (2022-07-04T16:13:27Z) - Masked Autoencoders for Self-Supervised Learning on Automotive Point
Clouds [2.8544513613730205]
Masked autoencoding has become a successful pre-training paradigm for Transformer models for text, images, and recently, point clouds.
We propose VoxelMAE, a simple masked autoencoding pretraining scheme designed for voxel representations.
Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset.
arXiv Detail & Related papers (2022-07-01T16:31:45Z) - Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud
Pre-training [56.81809311892475]
Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers.
We propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds.
arXiv Detail & Related papers (2022-05-28T11:22:53Z) - Masked Discrimination for Self-Supervised Learning on Point Clouds [27.652157544218234]
Masked autoencoding has achieved great success for self-supervised learning in the image and language domains.
Standard backbones like PointNet are unable to properly handle the training versus testing distribution mismatch introduced by masking during training.
We bridge this gap by proposing a discriminative mask pretraining Transformer framework, MaskPoint, for point clouds.
arXiv Detail & Related papers (2022-03-21T17:57:34Z) - Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point
Modeling [104.82953953453503]
We present Point-BERT, a new paradigm for learning Transformers to generalize the concept of BERT to 3D point cloud.
Experiments demonstrate that the proposed BERT-style pre-training strategy significantly improves the performance of standard point cloud Transformers.
arXiv Detail & Related papers (2021-11-29T18:59:03Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP)
We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image.
MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
arXiv Detail & Related papers (2021-06-10T11:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.