The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
- URL: http://arxiv.org/abs/2302.10494v4
- Date: Fri, 27 Sep 2024 14:50:23 GMT
- Title: The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
- Authors: Seungwoo Son, Jegwang Ryu, Namhoon Lee, Jaeho Lee,
- Abstract summary: In this paper, we develop a simple framework to reduce the supervision cost of ViT distillation.
By masking input tokens, one can skip the computations associated with the masked tokens without requiring any change to teacher parameters or architecture.
We find that masking patches with the lowest student attention scores is highly effective, saving up to 50% of teacher FLOPs without any drop in student accuracy.
- Score: 14.467509261354458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is an effective method for training lightweight vision models. However, acquiring teacher supervision for training samples is often costly, especially from large-scale models like vision transformers (ViTs). In this paper, we develop a simple framework to reduce the supervision cost of ViT distillation: masking out a fraction of input tokens given to the teacher. By masking input tokens, one can skip the computations associated with the masked tokens without requiring any change to teacher parameters or architecture. We find that masking patches with the lowest student attention scores is highly effective, saving up to 50% of teacher FLOPs without any drop in student accuracy, while other masking criterion leads to suboptimal efficiency gains. Through in-depth analyses, we reveal that the student-guided masking provides a good curriculum to the student, making teacher supervision easier to follow during the early stage and challenging in the later stage.
Related papers
- Bootstrap Masked Visual Modeling via Hard Patches Mining [68.74750345823674]
Masked visual modeling has attracted much attention due to its promising potential in learning generalizable representations.
We argue that it is equally important for the model to stand in the shoes of a teacher to produce challenging problems by itself.
To empower the model as a teacher, we propose Hard Patches Mining (HPM), predicting patch-wise losses and subsequently determining where to mask.
arXiv Detail & Related papers (2023-12-21T10:27:52Z) - Hybrid Distillation: Connecting Masked Autoencoders with Contrastive
Learners [102.20090188997301]
We explore how to obtain a model that combines Contrastive Learning (CL) and Masked Image Modeling (MIM) strengths.
In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy.
Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.
arXiv Detail & Related papers (2023-06-28T02:19:35Z) - Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - Hard Patches Mining for Masked Image Modeling [52.46714618641274]
Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations.
We propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training.
arXiv Detail & Related papers (2023-04-12T15:38:23Z) - Supervised Masked Knowledge Distillation for Few-Shot Transformers [36.46755346410219]
We propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers.
Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens.
Our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art.
arXiv Detail & Related papers (2023-03-25T03:31:46Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Exploring Target Representations for Masked Autoencoders [78.57196600585462]
We show that a careful choice of the target representation is unnecessary for learning good representations.
We propose a multi-stage masked distillation pipeline and use a randomly model as the teacher.
A proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins.
arXiv Detail & Related papers (2022-09-08T16:55:19Z) - What to Hide from Your Students: Attention-Guided Masked Image Modeling [32.402567373491834]
We argue that image token masking is fundamentally different from token masking in text.
We introduce a novel masking strategy, called attention-guided masking (AttMask)
arXiv Detail & Related papers (2022-03-23T20:52:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.