Good helper is around you: Attention-driven Masked Image Modeling
- URL: http://arxiv.org/abs/2211.15362v2
- Date: Thu, 1 Dec 2022 12:26:55 GMT
- Title: Good helper is around you: Attention-driven Masked Image Modeling
- Authors: Zhengqi Liu, Jie Gui, Hao Luo
- Abstract summary: Masked image modeling (MIM) has shown a huge potential in self-supervised learning.
We propose textbfAttention-driven Masking and Throwing Strategy (AMT)
AMT improves the linear probing accuracy of MAE by $2.9% sim 5.9%$ on CIFAR-10/100, STL-10, Tiny ImageNet, and ImageNet-1K.
- Score: 12.961634455083775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been witnessed that masked image modeling (MIM) has shown a huge
potential in self-supervised learning in the past year. Benefiting from the
universal backbone vision transformer, MIM learns self-supervised visual
representations through masking a part of patches of the image while attempting
to recover the missing pixels. Most previous works mask patches of the image
randomly, which underutilizes the semantic information that is beneficial to
visual representation learning. On the other hand, due to the large size of the
backbone, most previous works have to spend much time on pre-training. In this
paper, we propose \textbf{Attention-driven Masking and Throwing Strategy}
(AMT), which could solve both problems above. We first leverage the
self-attention mechanism to obtain the semantic information of the image during
the training process automatically without using any supervised methods.
Masking strategy can be guided by that information to mask areas selectively,
which is helpful for representation learning. Moreover, a redundant patch
throwing strategy is proposed, which makes learning more efficient. As a
plug-and-play module for masked image modeling, AMT improves the linear probing
accuracy of MAE by $2.9\% \sim 5.9\%$ on CIFAR-10/100, STL-10, Tiny ImageNet,
and ImageNet-1K, and obtains an improved performance with respect to
fine-tuning accuracy of MAE and SimMIM. Moreover, this design also achieves
superior performance on downstream detection and segmentation tasks. Code is
available at https://github.com/guijiejie/AMT.
Related papers
- Evolved Hierarchical Masking for Self-Supervised Learning [49.77271430882176]
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training.
This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning.
arXiv Detail & Related papers (2025-04-12T09:40:14Z) - Efficient Vision-Language Pre-training by Cluster Masking [13.845233914223561]
We propose a simple strategy for masking image patches during visual-language contrastive learning.
We randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities.
This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context.
arXiv Detail & Related papers (2024-05-14T17:59:40Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Masked Frequency Modeling for Self-Supervised Visual Pre-Training [102.89756957704138]
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models.
MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum.
For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
arXiv Detail & Related papers (2022-06-15T17:58:30Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.