Rethinking Patch Dependence for Masked Autoencoders
- URL: http://arxiv.org/abs/2401.14391v1
- Date: Thu, 25 Jan 2024 18:49:57 GMT
- Title: Rethinking Patch Dependence for Masked Autoencoders
- Authors: Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam
Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg
- Abstract summary: We re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE)
We propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE)
- Score: 92.37365660775171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we re-examine inter-patch dependencies in the decoding
mechanism of masked autoencoders (MAE). We decompose this decoding mechanism
for masked patch reconstruction in MAE into self-attention and cross-attention.
Our investigations suggest that self-attention between mask patches is not
essential for learning good representations. To this end, we propose a novel
pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE).
CrossMAE's decoder leverages only cross-attention between masked and visible
tokens, with no degradation in downstream performance. This design also enables
decoding only a small subset of mask tokens, boosting efficiency. Furthermore,
each decoder block can now leverage different encoder features, resulting in
improved representation learning. CrossMAE matches MAE in performance with 2.5
to 3.7$\times$ less decoding compute. It also surpasses MAE on ImageNet
classification and COCO instance segmentation under the same compute. Code and
models: https://crossmae.github.io
Related papers
- Downstream Task Guided Masking Learning in Masked Autoencoders Using
Multi-Level Optimization [42.82742477950748]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.
We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that learns an optimal masking strategy during pretraining.
Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning.
arXiv Detail & Related papers (2024-02-28T07:37:26Z) - Regress Before Construct: Regress Autoencoder for Point Cloud
Self-supervised Learning [18.10704604275133]
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning.
Our approach is efficient during pre-training and generalizes well on various downstream tasks.
arXiv Detail & Related papers (2023-09-25T17:23:33Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z) - MAE-AST: Masked Autoencoding Audio Spectrogram Transformer [11.814012909512307]
We propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.
We leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens.
We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST.
arXiv Detail & Related papers (2022-03-30T22:06:13Z) - Context Autoencoder for Self-Supervised Representation Learning [64.63908944426224]
We pretrain an encoder by making predictions in the encoded representation space.
The network is an encoder-regressor-decoder architecture.
We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks.
arXiv Detail & Related papers (2022-02-07T09:33:45Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.