Bootstrapped Masked Autoencoders for Vision BERT Pretraining
- URL: http://arxiv.org/abs/2207.07116v1
- Date: Thu, 14 Jul 2022 17:59:58 GMT
- Title: Bootstrapped Masked Autoencoders for Vision BERT Pretraining
- Authors: Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and
Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and Nenghai Yu
- Abstract summary: BootMAE improves the original masked autoencoders (MAE) with two core designs.
1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining.
- Score: 142.5285802605117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose bootstrapped masked autoencoders (BootMAE), a new approach for
vision BERT pretraining. BootMAE improves the original masked autoencoders
(MAE) with two core designs: 1) momentum encoder that provides online feature
as extra BERT prediction targets; 2) target-aware decoder that tries to reduce
the pressure on the encoder to memorize target-specific information in BERT
pretraining. The first design is motivated by the observation that using a
pretrained MAE to extract the features as the BERT prediction target for masked
tokens can achieve better pretraining performance. Therefore, we add a momentum
encoder in parallel with the original MAE encoder, which bootstraps the
pretraining performance by using its own representation as the BERT prediction
target. In the second design, we introduce target-specific information (e.g.,
pixel values of unmasked patches) from the encoder directly to the decoder to
reduce the pressure on the encoder of memorizing the target-specific
information. Thus, the encoder focuses on semantic modeling, which is the goal
of BERT pretraining, and does not need to waste its capacity in memorizing the
information of unmasked tokens related to the prediction target. Through
extensive experiments, our BootMAE achieves $84.2\%$ Top-1 accuracy on
ImageNet-1K with ViT-B backbone, outperforming MAE by $+0.8\%$ under the same
pre-training epochs. BootMAE also gets $+1.0$ mIoU improvements on semantic
segmentation on ADE20K and $+1.3$ box AP, $+1.4$ mask AP improvement on object
detection and segmentation on COCO dataset. Code is released at
https://github.com/LightDXY/BootMAE.
Related papers
- PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders [57.31790812209751]
We show that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well.
We propose a simple yet effective method, i.e., learning to Predict Centers for Point Masked AutoEncoders (PCP-MAE)
Our method is of high pre-training efficiency compared to other alternatives and achieves great improvement over Point-MAE.
arXiv Detail & Related papers (2024-08-16T13:53:53Z) - Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval [26.00149743478937]
Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems.
We propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task.
Our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters.
arXiv Detail & Related papers (2024-01-20T15:02:33Z) - Regress Before Construct: Regress Autoencoder for Point Cloud
Self-supervised Learning [18.10704604275133]
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning.
Our approach is efficient during pre-training and generalizes well on various downstream tasks.
arXiv Detail & Related papers (2023-09-25T17:23:33Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Context Autoencoder for Self-Supervised Representation Learning [64.63908944426224]
We pretrain an encoder by making predictions in the encoded representation space.
The network is an encoder-regressor-decoder architecture.
We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks.
arXiv Detail & Related papers (2022-02-07T09:33:45Z) - PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [102.7922200135147]
This paper explores a better codebook for BERT pre-training of vision transformers.
By contrast, the discrete tokens in NLP field are naturally highly semantic.
We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
arXiv Detail & Related papers (2021-11-24T18:59:58Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - EncoderMI: Membership Inference against Pre-trained Encoders in
Contrastive Learning [27.54202989524394]
We proposeMI, the first membership inference method against image encoders pre-trained by contrastive learning.
We evaluateMI on image encoders pre-trained on multiple datasets by ourselves as well as the Contrastive Language-Image Pre-training (CLIP) image encoder, which is pre-trained on 400 million (image, text) pairs collected from the Internet and released by OpenAI.
arXiv Detail & Related papers (2021-08-25T03:00:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.