Revealing the Dark Secrets of Masked Image Modeling
- URL: http://arxiv.org/abs/2205.13543v2
- Date: Fri, 27 May 2022 15:12:37 GMT
- Title: Revealing the Dark Secrets of Masked Image Modeling
- Authors: Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, Yue Cao
- Abstract summary: Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear.
In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments.
We find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers.
- Score: 25.221516344869805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked image modeling (MIM) as pre-training is shown to be effective for
numerous vision downstream tasks, but how and where MIM works remain unclear.
In this paper, we compare MIM with the long-dominant supervised pre-trained
models from two perspectives, the visualizations and the experiments, to
uncover their key representational differences. From the visualizations, we
find that MIM brings locality inductive bias to all layers of the trained
models, but supervised models tend to focus locally at lower layers but more
globally at higher layers. That may be the reason why MIM helps Vision
Transformers that have a very large receptive field to optimize. Using MIM, the
model can maintain a large diversity on attention heads in all layers. But for
supervised models, the diversity on attention heads almost disappears from the
last three layers and less diversity harms the fine-tuning performance. From
the experiments, we find that MIM models can perform significantly better on
geometric and motion tasks with weak semantics or fine-grained classification
tasks, than their supervised counterparts. Without bells and whistles, a
standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on
pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth
estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object
tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the
categories are sufficiently covered by the supervised pre-training, MIM models
can still achieve highly competitive transfer performance. With a deeper
understanding of MIM, we hope that our work can inspire new and solid research
in this direction.
Related papers
- LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations [16.885965702357314]
MIM-Refiner is a contrastive learning boost for pre-trained MIM models.
We refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features.
arXiv Detail & Related papers (2024-02-15T16:46:16Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders [17.564722905991776]
We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features.
Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.
arXiv Detail & Related papers (2023-04-25T03:01:37Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - MimCo: Masked Image Modeling Pre-training with Contrastive Teacher [14.413674270588023]
Masked image modeling (MIM) has received much attention in self-supervised learning (SSL)
visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training.
We propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training.
arXiv Detail & Related papers (2022-09-07T10:59:05Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.