Meta Knowledge Distillation
- URL: http://arxiv.org/abs/2202.07940v1
- Date: Wed, 16 Feb 2022 09:09:51 GMT
- Title: Meta Knowledge Distillation
- Authors: Jihao Liu and Boxiao Liu and Hongsheng Li and Yu Liu
- Abstract summary: We propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters.
With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.
- Score: 33.48131864248235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies pointed out that knowledge distillation (KD) suffers from two
degradation problems, the teacher-student gap and the incompatibility with
strong data augmentations, making it not applicable to training
state-of-the-art models, which are trained with advanced augmentations.
However, we observe that a key factor, i.e., the temperatures in the softmax
functions for generating probabilities of both the teacher and student models,
was mostly overlooked in previous methods. With properly tuned temperatures,
such degradation problems of KD can be much mitigated. However, instead of
relying on a naive grid search, which shows poor transferability, we propose
Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable
meta temperature parameters. The meta parameters are adaptively adjusted during
training according to the gradients of the learning objective. We validate that
MKD is robust to different dataset scales, different teacher/student
architectures, and different types of data augmentation. With MKD, we achieve
the best performance with popular ViT architectures among compared methods that
use only ImageNet-1K as training data, ranging from tiny to large models. With
ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that
trains for 1,650 epochs.
Related papers
- Dynamic Temperature Scheduler for Knowledge Distillation [8.855130508913513]
Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model.<n>Traditional methods use a fixed temperature throughout training, which is suboptimal.<n>We introduce Dynamic Temperature Scheduler (DTS), which adjusts temperature dynamically based on the cross-entropy loss gap between teacher and student.
arXiv Detail & Related papers (2025-11-14T16:03:22Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.
Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation.
Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z) - ScaleKD: Strong Vision Transformers Could Be Excellent Teachers [15.446480934024652]
We present a simple and effective knowledge distillation method, called ScaleKD.
Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets.
When scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties.
arXiv Detail & Related papers (2024-11-11T08:25:21Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - DisWOT: Student Architecture Search for Distillation WithOut Training [0.0]
We explore a novel training-free framework to search for the best student architectures for a given teacher.
Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation.
Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces.
arXiv Detail & Related papers (2023-03-28T01:58:45Z) - Online Hyperparameter Optimization for Class-Incremental Learning [99.70569355681174]
Class-incremental learning (CIL) aims to train a classification model while the number of classes increases phase-by-phase.
An inherent challenge of CIL is the stability-plasticity tradeoff, i.e., CIL models should keep stable to retain old knowledge and keep plastic to absorb new knowledge.
We propose an online learning method that can adaptively optimize the tradeoff without knowing the setting as a priori.
arXiv Detail & Related papers (2023-01-11T17:58:51Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - LTD: Low Temperature Distillation for Robust Adversarial Training [1.3300217947936062]
Adversarial training has been widely used to enhance the robustness of neural network models against adversarial attacks.
Despite the popularity of neural network models, a significant gap exists between the natural and robust accuracy of these models.
We propose a novel method called Low Temperature Distillation (LTD) that generates soft labels using the modified knowledge distillation framework.
arXiv Detail & Related papers (2021-11-03T16:26:00Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.