TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
- URL: http://arxiv.org/abs/2301.01296v1
- Date: Tue, 3 Jan 2023 18:59:54 GMT
- Title: TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
- Authors: Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu
- Abstract summary: Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
- Score: 31.16595289223858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked image modeling (MIM) performs strongly in pre-training large vision
Transformers (ViTs). However, small models that are critical for real-world
applications cannot or only marginally benefit from this pre-training approach.
In this paper, we explore distillation techniques to transfer the success of
large MIM-based pre-trained models to smaller ones. We systematically study
different options in the distillation framework, including distilling targets,
losses, input, network regularization, sequential distillation, etc, revealing
that: 1) Distilling token relations is more effective than CLS token- and
feature-based distillation; 2) An intermediate layer of the teacher network as
target perform better than that using the last layer when the depth of the
student mismatches that of the teacher; 3) Weak regularization is preferred;
etc. With these findings, we achieve significant fine-tuning accuracy
improvements over the scratch MIM pre-training on ImageNet-1K classification,
using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4%
gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K
semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM
model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image
classification, which sets a new record for small vision models of the same
size and computation budget. This strong performance suggests an alternative
way for developing small vision Transformer models, that is, by exploring
better training methods rather than introducing inductive biases into
architectures as in most previous works. Code is available at
https://github.com/OliverRensu/TinyMIM.
Related papers
- ScaleKD: Strong Vision Transformers Could Be Excellent Teachers [15.446480934024652]
We present a simple and effective knowledge distillation method, called ScaleKD.
Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets.
When scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties.
arXiv Detail & Related papers (2024-11-11T08:25:21Z) - An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding.
This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks.
We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z) - Heterogeneous Generative Knowledge Distillation with Masked Image
Modeling [33.95780732124864]
Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models.
We develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion.
Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models.
arXiv Detail & Related papers (2023-09-18T08:30:55Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders [17.564722905991776]
We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features.
Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.
arXiv Detail & Related papers (2023-04-25T03:01:37Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - Core Risk Minimization using Salient ImageNet [53.616101711801484]
We introduce the Salient Imagenet dataset with more than 1 million soft masks localizing core and spurious features for all 1000 Imagenet classes.
Using this dataset, we first evaluate the reliance of several Imagenet pretrained models (42 total) on spurious features.
Next, we introduce a new learning paradigm called Core Risk Minimization (CoRM) whose objective ensures that the model predicts a class using its core features.
arXiv Detail & Related papers (2022-03-28T01:53:34Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.