Knowledge Distillation with Representative Teacher Keys Based on
Attention Mechanism for Image Classification Model Compression
- URL: http://arxiv.org/abs/2206.12788v1
- Date: Sun, 26 Jun 2022 05:08:50 GMT
- Title: Knowledge Distillation with Representative Teacher Keys Based on
Attention Mechanism for Image Classification Model Compression
- Authors: Jun-Teng Yang, Sheng-Che Kao and Scott C.-H. Huang
- Abstract summary: knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters.
Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK)
Our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.
- Score: 1.503974529275767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the improvement of AI chips (e.g., GPU, TPU, and NPU) and the fast
development of internet of things (IoTs), some powerful deep neural networks
(DNNs) are usually composed of millions or even hundreds of millions of
parameters, which may not be suitable to be directly deployed on low
computation and low capacity units (e.g., edge devices). Recently, knowledge
distillation (KD) has been recognized as one of the effective method of model
compression to decrease the model parameters. The main concept of KD is to
extract useful information from the feature maps of a large model (i.e.,
teacher model) as a reference to successfully train a small model (i.e.,
student model) which model size is much smaller than the teacher one. Although
many KD-based methods have been proposed to utilize the information from the
feature maps of intermediate layers in teacher model, however, most of them did
not consider the similarity of feature maps between teacher model and student
model, which may let student model learn useless information. Inspired by
attention mechanism, we propose a novel KD method called representative teacher
key (RTK) that not only consider the similarity of feature maps but also filter
out the useless information to improve the performance of the target student
model. In the experiments, we validate our proposed method with several
backbone networks (e.g., ResNet and WideResNet) and datasets (e.g., CIFAR10,
CIFAR100, SVHN, and CINIC10). The results show that our proposed RTK can
effectively improve the classification accuracy of the state-of-the-art
attention-based KD method.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Robust Knowledge Distillation Based on Feature Variance Against Backdoored Teacher Model [13.367731896112861]
Knowledge distillation (KD) is one of the widely used compression techniques for edge deployment.
This paper proposes RobustKD, a robust KD that compresses the model while mitigating backdoor based on feature variance.
arXiv Detail & Related papers (2024-06-01T11:25:03Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Lightweight Self-Knowledge Distillation with Multi-source Information
Fusion [3.107478665474057]
Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models.
We propose a lightweight SKD framework that utilizes multi-source information to construct a more informative teacher.
We validate the performance of the proposed DRG, DSR, and their combination through comprehensive experiments on various datasets and models.
arXiv Detail & Related papers (2023-05-16T05:46:31Z) - Directed Acyclic Graph Factorization Machines for CTR Prediction via
Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation.
KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z) - CES-KD: Curriculum-based Expert Selection for Guided Knowledge
Distillation [4.182345120164705]
This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD)
CES-KD is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum.
Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image.
arXiv Detail & Related papers (2022-09-15T21:02:57Z) - Reducing Capacity Gap in Knowledge Distillation with Review Mechanism
for Crowd Counting [16.65360204274379]
This paper introduces a novel review mechanism following KD models, motivated by the review mechanism of human-beings during the study.
The effectiveness of ReviewKD is demonstrated by a set of experiments over six benchmark datasets.
We also show that the suggested review mechanism can be used as a plug-and-play module to further boost the performance of a kind of heavy crowd counting models.
arXiv Detail & Related papers (2022-06-11T09:11:42Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Boosting Light-Weight Depth Estimation Via Knowledge Distillation [21.93879961636064]
We propose a lightweight network that can accurately estimate depth maps using minimal computing resources.
We achieve this by designing a compact model architecture that maximally reduces model complexity.
Our method achieves comparable performance to state-of-the-art methods while using only 1% of their parameters.
arXiv Detail & Related papers (2021-05-13T08:42:42Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.