SimMIM: A Simple Framework for Masked Image Modeling
- URL: http://arxiv.org/abs/2111.09886v1
- Date: Thu, 18 Nov 2021 18:59:45 GMT
- Title: SimMIM: A Simple Framework for Masked Image Modeling
- Authors: Zhenda Xie and Zheng Zhang and Yue Cao and Yutong Lin and Jianmin Bao
and Zhuliang Yao and Qi Dai and Han Hu
- Abstract summary: This paper presents Sim, a simple framework for masked image modeling.
We study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance.
We also leverage this approach to facilitate the training of a 3B model, that by $40times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.
- Score: 29.015777125540613
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents SimMIM, a simple framework for masked image modeling. We
simplify recently proposed related approaches without special designs such as
block-wise masking and tokenization via discrete VAE or clustering. To study
what let the masked image modeling task learn good representations, we
systematically study the major components in our framework, and find that
simple designs of each component have revealed very strong representation
learning performance: 1) random masking of the input image with a moderately
large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting
raw pixels of RGB values by direct regression performs no worse than the patch
classification approaches with complex designs; 3) the prediction head can be
as light as a linear layer, with no worse performance than heavier ones. Using
ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by
pre-training also on this dataset, surpassing previous best approach by +0.6%.
When applied on a larger model of about 650 million parameters, SwinV2-H, it
achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We
also leverage this approach to facilitate the training of a 3B model
(SwinV2-G), that by $40\times$ less data than that in previous practice, we
achieve the state-of-the-art on four representative vision benchmarks. The code
and models will be publicly available at https://github.com/microsoft/SimMIM.
Related papers
- Keypoint Aware Masked Image Modelling [0.34530027457862006]
KAMIM improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs.
We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers.
arXiv Detail & Related papers (2024-07-18T19:41:46Z) - Improve Supervised Representation Learning with Masked Image Modeling [30.30649867772395]
We propose a simple yet effective setup that can easily integrate masked image modeling into existing supervised training paradigms.
We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations.
arXiv Detail & Related papers (2023-12-01T22:03:25Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - A Unified View of Masked Image Modeling [117.79456335844439]
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers.
We introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T14:59:18Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - Core Risk Minimization using Salient ImageNet [53.616101711801484]
We introduce the Salient Imagenet dataset with more than 1 million soft masks localizing core and spurious features for all 1000 Imagenet classes.
Using this dataset, we first evaluate the reliance of several Imagenet pretrained models (42 total) on spurious features.
Next, we introduce a new learning paradigm called Core Risk Minimization (CoRM) whose objective ensures that the model predicts a class using its core features.
arXiv Detail & Related papers (2022-03-28T01:53:34Z) - Combined Scaling for Zero-shot Transfer Learning [146.0851484769142]
We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set.
This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%.
Our model also shows significant improvements in robustness benchmarks.
arXiv Detail & Related papers (2021-11-19T05:25:46Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.