Masked Autoencoders Enable Efficient Knowledge Distillers
- URL: http://arxiv.org/abs/2208.12256v1
- Date: Thu, 25 Aug 2022 17:58:59 GMT
- Title: Masked Autoencoders Enable Efficient Knowledge Distillers
- Authors: Yutong Bai, Zeyu Wang, Junfei Xiao, Chen Wei, Huiyu Wang, Alan Yuille,
Yuyin Zhou, Cihang Xie
- Abstract summary: This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders.
We minimize the distance between the intermediate feature map of the teacher model and that of the student model.
Our method can robustly distill knowledge from teacher models even with extremely high masking ratios.
- Score: 31.606287119666572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the potential of distilling knowledge from pre-trained
models, especially Masked Autoencoders. Our approach is simple: in addition to
optimizing the pixel reconstruction loss on masked inputs, we minimize the
distance between the intermediate feature map of the teacher model and that of
the student model. This design leads to a computationally efficient knowledge
distillation framework, given 1) only a small visible subset of patches is
used, and 2) the (cumbersome) teacher model only needs to be partially
executed, \ie, forward propagate inputs through the first few layers, for
obtaining intermediate feature maps. Compared to directly distilling fine-tuned
models, distilling pre-trained models substantially improves downstream
performance. For example, by distilling the knowledge from an MAE pre-trained
ViT-L into a ViT-B, our method achieves 84.0% ImageNet top-1 accuracy,
outperforming the baseline of directly distilling a fine-tuned ViT-L by 1.2%.
More intriguingly, our method can robustly distill knowledge from teacher
models even with extremely high masking ratios: e.g., with 95% masking ratio
where merely TEN patches are visible during distillation, our ViT-B
competitively attains a top-1 ImageNet accuracy of 83.6%; surprisingly, it can
still secure 82.4% top-1 ImageNet accuracy by aggressively training with just
FOUR visible patches (98% masking ratio). The code and models are publicly
available at https://github.com/UCSC-VLAA/DMAE.
Related papers
- ScaleKD: Strong Vision Transformers Could Be Excellent Teachers [15.446480934024652]
We present a simple and effective knowledge distillation method, called ScaleKD.
Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets.
When scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties.
arXiv Detail & Related papers (2024-11-11T08:25:21Z) - Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding.
This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks.
We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z) - TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight
Inheritance [97.01406871579525]
We propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models.
We show that TinyCLIP can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance.
Our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet.
arXiv Detail & Related papers (2023-09-21T17:59:53Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - A simple, efficient and scalable contrastive masked autoencoder for
learning visual representations [21.440853288058452]
We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations.
Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models.
arXiv Detail & Related papers (2022-10-30T16:21:22Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models.
Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model.
We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z) - MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet
without Tricks [57.69809561405253]
We introduce a framework that is able to boost the vanilla ResNet-50 to 80%+ Top-1 accuracy on ImageNet without tricks.
Our method obtains 80.67% top-1 accuracy on ImageNet using a single crop-size of 224x224 with vanilla ResNet-50.
Our framework consistently improves from 69.76% to 73.19% on smaller ResNet-18.
arXiv Detail & Related papers (2020-09-17T17:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.