Related papers: Knowledge Distillation with Representative Teacher Keys Based on Attention Mechanism for Image Classification Model Compression

Knowledge Distillation with Representative Teacher Keys Based on Attention Mechanism for Image Classification Model Compression

URL: http://arxiv.org/abs/2206.12788v1
Date: Sun, 26 Jun 2022 05:08:50 GMT
Title: Knowledge Distillation with Representative Teacher Keys Based on Attention Mechanism for Image Classification Model Compression
Authors: Jun-Teng Yang, Sheng-Che Kao and Scott C.-H. Huang
Abstract summary: knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters. Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK) Our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.
Score: 1.503974529275767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the improvement of AI chips (e.g., GPU, TPU, and NPU) and the fast development of internet of things (IoTs), some powerful deep neural networks (DNNs) are usually composed of millions or even hundreds of millions of parameters, which may not be suitable to be directly deployed on low computation and low capacity units (e.g., edge devices). Recently, knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters. The main concept of KD is to extract useful information from the feature maps of a large model (i.e., teacher model) as a reference to successfully train a small model (i.e., student model) which model size is much smaller than the teacher one. Although many KD-based methods have been proposed to utilize the information from the feature maps of intermediate layers in teacher model, however, most of them did not consider the similarity of feature maps between teacher model and student model, which may let student model learn useless information. Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK) that not only consider the similarity of feature maps but also filter out the useless information to improve the performance of the target student model. In the experiments, we validate our proposed method with several backbone networks (e.g., ResNet and WideResNet) and datasets (e.g., CIFAR10, CIFAR100, SVHN, and CINIC10). The results show that our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.

Related papers

Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients [0.0]
This paper proposes a machine learning framework that augments Knowledge Distillation (KD) with Integrated Gradients (IG) We introduce a novel data augmentation strategy where IG maps, precomputed from a teacher model, are overlaid onto training images to guide a compact student model toward critical feature representations. Experiments on CIFAR-10 demonstrate the efficacy of our method: a student model, compressed 4.1-fold from the MobileNet-V2 teacher, achieves 92.5% classification accuracy, surpassing the baseline student's 91.4% and traditional KD approaches, while reducing inference latency from 140 ms to 13 ms--a tenfold
arXiv Detail & Related papers (2025-03-17T10:07:50Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Robust Knowledge Distillation Based on Feature Variance Against Backdoored Teacher Model [13.367731896112861]
Knowledge distillation (KD) is one of the widely used compression techniques for edge deployment. This paper proposes RobustKD, a robust KD that compresses the model while mitigating backdoor based on feature variance.
arXiv Detail & Related papers (2024-06-01T11:25:03Z)
Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD) The core of ATKD is to reduce rote learning and make teaching more diverse and flexible. Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z)
Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference. We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z)
Lightweight Self-Knowledge Distillation with Multi-source Information Fusion [3.107478665474057]
Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models. We propose a lightweight SKD framework that utilizes multi-source information to construct a more informative teacher. We validate the performance of the proposed DRG, DSR, and their combination through comprehensive experiments on various datasets and models.
arXiv Detail & Related papers (2023-05-16T05:46:31Z)
Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation. KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z)
CES-KD: Curriculum-based Expert Selection for Guided Knowledge Distillation [4.182345120164705]
This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD) CES-KD is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum. Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image.
arXiv Detail & Related papers (2022-09-15T21:02:57Z)
Reducing Capacity Gap in Knowledge Distillation with Review Mechanism for Crowd Counting [16.65360204274379]
This paper introduces a novel review mechanism following KD models, motivated by the review mechanism of human-beings during the study. The effectiveness of ReviewKD is demonstrated by a set of experiments over six benchmark datasets. We also show that the suggested review mechanism can be used as a plug-and-play module to further boost the performance of a kind of heavy crowd counting models.
arXiv Detail & Related papers (2022-06-11T09:11:42Z)
How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD) We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy. Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z)
Boosting Light-Weight Depth Estimation Via Knowledge Distillation [21.93879961636064]
We propose a lightweight network that can accurately estimate depth maps using minimal computing resources. We achieve this by designing a compact model architecture that maximally reduces model complexity. Our method achieves comparable performance to state-of-the-art methods while using only 1% of their parameters.
arXiv Detail & Related papers (2021-05-13T08:42:42Z)
Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.