Related papers: Bootstrapped Masked Autoencoders for Vision BERT Pretraining

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

URL: http://arxiv.org/abs/2207.07116v1
Date: Thu, 14 Jul 2022 17:59:58 GMT
Title: Bootstrapped Masked Autoencoders for Vision BERT Pretraining
Authors: Xiaoyi Dong and Jianmin Bao and Ting Zhang and Dongdong Chen and Weiming Zhang and Lu Yuan and Dong Chen and Fang Wen and Nenghai Yu
Abstract summary: BootMAE improves the original masked autoencoders (MAE) with two core designs. 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining.
Score: 142.5285802605117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features as the BERT prediction target for masked tokens can achieve better pretraining performance. Therefore, we add a momentum encoder in parallel with the original MAE encoder, which bootstraps the pretraining performance by using its own representation as the BERT prediction target. In the second design, we introduce target-specific information (e.g., pixel values of unmasked patches) from the encoder directly to the decoder to reduce the pressure on the encoder of memorizing the target-specific information. Thus, the encoder focuses on semantic modeling, which is the goal of BERT pretraining, and does not need to waste its capacity in memorizing the information of unmasked tokens related to the prediction target. Through extensive experiments, our BootMAE achieves $84.2\%$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming MAE by $+0.8\%$ under the same pre-training epochs. BootMAE also gets $+1.0$ mIoU improvements on semantic segmentation on ADE20K and $+1.3$ box AP, $+1.4$ mask AP improvement on object detection and segmentation on COCO dataset. Code is released at https://github.com/LightDXY/BootMAE.

Related papers

Enhancing DNA Foundation Models to Address Masking Inefficiencies [18.54660252939211]
We propose a modified encoder-decoder architecture based on the masked autoencoder framework. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes.
arXiv Detail & Related papers (2025-02-25T17:56:25Z)
PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders [57.31790812209751]
We show that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well. We propose a simple yet effective method, i.e., learning to Predict Centers for Point Masked AutoEncoders (PCP-MAE) Our method is of high pre-training efficiency compared to other alternatives and achieves great improvement over Point-MAE.
arXiv Detail & Related papers (2024-08-16T13:53:53Z)
Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval [26.00149743478937]
Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems. We propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task. Our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters.
arXiv Detail & Related papers (2024-01-20T15:02:33Z)
Regress Before Construct: Regress Autoencoder for Point Cloud Self-supervised Learning [18.10704604275133]
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for 2D and 3D computer vision. We propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning. Our approach is efficient during pre-training and generalizes well on various downstream tasks.
arXiv Detail & Related papers (2023-09-25T17:23:33Z)
SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper. With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z)
Context Autoencoder for Self-Supervised Representation Learning [64.63908944426224]
We pretrain an encoder by making predictions in the encoded representation space. The network is an encoder-regressor-decoder architecture. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks.
arXiv Detail & Related papers (2022-02-07T09:33:45Z)
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [102.7922200135147]
This paper explores a better codebook for BERT pre-training of vision transformers. By contrast, the discrete tokens in NLP field are naturally highly semantic. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
arXiv Detail & Related papers (2021-11-24T18:59:58Z)
Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
EncoderMI: Membership Inference against Pre-trained Encoders in Contrastive Learning [27.54202989524394]
We proposeMI, the first membership inference method against image encoders pre-trained by contrastive learning. We evaluateMI on image encoders pre-trained on multiple datasets by ourselves as well as the Contrastive Language-Image Pre-training (CLIP) image encoder, which is pre-trained on 400 million (image, text) pairs collected from the Internet and released by OpenAI.
arXiv Detail & Related papers (2021-08-25T03:00:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.