CES-KD: Curriculum-based Expert Selection for Guided Knowledge
Distillation
- URL: http://arxiv.org/abs/2209.07606v1
- Date: Thu, 15 Sep 2022 21:02:57 GMT
- Title: CES-KD: Curriculum-based Expert Selection for Guided Knowledge
Distillation
- Authors: Ibtihel Amara, Maryam Ziaeefard, Brett H. Meyer, Warren Gross and
James J. Clark
- Abstract summary: This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD)
CES-KD is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum.
Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image.
- Score: 4.182345120164705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is an effective tool for compressing deep
classification models for edge devices. However, the performance of KD is
affected by the large capacity gap between the teacher and student networks.
Recent methods have resorted to a multiple teacher assistant (TA) setting for
KD, which sequentially decreases the size of the teacher model to relatively
bridge the size gap between these models. This paper proposes a new technique
called Curriculum Expert Selection for Knowledge Distillation (CES-KD) to
efficiently enhance the learning of a compact student under the capacity gap
problem. This technique is built upon the hypothesis that a student network
should be guided gradually using stratified teaching curriculum as it learns
easy (hard) data samples better and faster from a lower (higher) capacity
teacher network. Specifically, our method is a gradual TA-based KD technique
that selects a single teacher per input image based on a curriculum driven by
the difficulty in classifying the image. In this work, we empirically verify
our hypothesis and rigorously experiment with CIFAR-10, CIFAR-100, CINIC-10,
and ImageNet datasets and show improved accuracy on VGG-like models, ResNets,
and WideResNets architectures.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Invariant Causal Knowledge Distillation in Neural Networks [6.24302896438145]
In this paper, we introduce Invariant Consistency Distillation (ICD), a novel methodology designed to enhance knowledge distillation.
ICD ensures that the student model's representations are both discriminative and invariant with respect to the teacher's outputs.
Our results on CIFAR-100 and ImageNet ILSVRC-2012 show that ICD outperforms traditional KD techniques and surpasses state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T14:53:35Z) - Adaptive Teaching with Shared Classifier for Knowledge Distillation [6.03477652126575]
Knowledge distillation (KD) is a technique used to transfer knowledge from a teacher network to a student network.
We propose adaptive teaching with a shared classifier (ATSC)
Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios.
arXiv Detail & Related papers (2024-06-12T08:51:08Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - BD-KD: Balancing the Divergences for Online Knowledge Distillation [12.27903419909491]
We propose BD-KD: Balancing of Divergences for online Knowledge Distillation.
We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network.
We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network.
arXiv Detail & Related papers (2022-12-25T22:27:32Z) - Knowledge Distillation with Representative Teacher Keys Based on
Attention Mechanism for Image Classification Model Compression [1.503974529275767]
knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters.
Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK)
Our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.
arXiv Detail & Related papers (2022-06-26T05:08:50Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Boosting Light-Weight Depth Estimation Via Knowledge Distillation [21.93879961636064]
We propose a lightweight network that can accurately estimate depth maps using minimal computing resources.
We achieve this by designing a compact model architecture that maximally reduces model complexity.
Our method achieves comparable performance to state-of-the-art methods while using only 1% of their parameters.
arXiv Detail & Related papers (2021-05-13T08:42:42Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.