Good helper is around you: Attention-driven Masked Image Modeling
- URL: http://arxiv.org/abs/2211.15362v2
- Date: Thu, 1 Dec 2022 12:26:55 GMT
- Title: Good helper is around you: Attention-driven Masked Image Modeling
- Authors: Zhengqi Liu, Jie Gui, Hao Luo
- Abstract summary: Masked image modeling (MIM) has shown a huge potential in self-supervised learning.
We propose textbfAttention-driven Masking and Throwing Strategy (AMT)
AMT improves the linear probing accuracy of MAE by $2.9% sim 5.9%$ on CIFAR-10/100, STL-10, Tiny ImageNet, and ImageNet-1K.
- Score: 12.961634455083775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been witnessed that masked image modeling (MIM) has shown a huge
potential in self-supervised learning in the past year. Benefiting from the
universal backbone vision transformer, MIM learns self-supervised visual
representations through masking a part of patches of the image while attempting
to recover the missing pixels. Most previous works mask patches of the image
randomly, which underutilizes the semantic information that is beneficial to
visual representation learning. On the other hand, due to the large size of the
backbone, most previous works have to spend much time on pre-training. In this
paper, we propose \textbf{Attention-driven Masking and Throwing Strategy}
(AMT), which could solve both problems above. We first leverage the
self-attention mechanism to obtain the semantic information of the image during
the training process automatically without using any supervised methods.
Masking strategy can be guided by that information to mask areas selectively,
which is helpful for representation learning. Moreover, a redundant patch
throwing strategy is proposed, which makes learning more efficient. As a
plug-and-play module for masked image modeling, AMT improves the linear probing
accuracy of MAE by $2.9\% \sim 5.9\%$ on CIFAR-10/100, STL-10, Tiny ImageNet,
and ImageNet-1K, and obtains an improved performance with respect to
fine-tuning accuracy of MAE and SimMIM. Moreover, this design also achieves
superior performance on downstream detection and segmentation tasks. Code is
available at https://github.com/guijiejie/AMT.
Related papers
- Downstream Task Guided Masking Learning in Masked Autoencoders Using
Multi-Level Optimization [42.82742477950748]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.
We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that learns an optimal masking strategy during pretraining.
Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning.
arXiv Detail & Related papers (2024-02-28T07:37:26Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.