Dynamic Rectification Knowledge Distillation
- URL: http://arxiv.org/abs/2201.11319v1
- Date: Thu, 27 Jan 2022 04:38:01 GMT
- Title: Dynamic Rectification Knowledge Distillation
- Authors: Fahad Rahman Amik, Ahnaf Ismat Tasin, Silvia Ahmed, M. M. Lutfe Elahi,
Nabeel Mohammed
- Abstract summary: Dynamic Rectification Knowledge Distillation (DR-KD) is a knowledge distillation framework.
DR-KD transforms the student into its own teacher, and if the self-teacher makes wrong predictions while distilling information, the error is rectified prior to the knowledge being distilled.
Our proposed DR-KD performs remarkably well in the absence of a sophisticated cumbersome teacher model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Distillation is a technique which aims to utilize dark knowledge to
compress and transfer information from a vast, well-trained neural network
(teacher model) to a smaller, less capable neural network (student model) with
improved inference efficiency. This approach of distilling knowledge has gained
popularity as a result of the prohibitively complicated nature of such
cumbersome models for deployment on edge computing devices. Generally, the
teacher models used to teach smaller student models are cumbersome in nature
and expensive to train. To eliminate the necessity for a cumbersome teacher
model completely, we propose a simple yet effective knowledge distillation
framework that we termed Dynamic Rectification Knowledge Distillation (DR-KD).
Our method transforms the student into its own teacher, and if the self-teacher
makes wrong predictions while distilling information, the error is rectified
prior to the knowledge being distilled. Specifically, the teacher targets are
dynamically tweaked by the agency of ground-truth while distilling the
knowledge gained from traditional training. Our proposed DR-KD performs
remarkably well in the absence of a sophisticated cumbersome teacher model and
achieves comparable performance to existing state-of-the-art teacher-free
knowledge distillation frameworks when implemented by a low-cost dynamic
mannered teacher. Our approach is all-encompassing and can be utilized for any
deep neural network training that requires categorization or object
recognition. DR-KD enhances the test accuracy on Tiny ImageNet by 2.65% over
prominent baseline models, which is significantly better than any other
knowledge distillation approach while requiring no additional training costs.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Student-friendly Knowledge Distillation [1.5469452301122173]
We propose student-friendly knowledge distillation (SKD) to simplify teacher output into new knowledge representations.
SKD contains a softening processing and a learning simplifier.
The experimental results on the CIFAR-100 and ImageNet datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-05-18T11:44:30Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z) - DE-RRD: A Knowledge Distillation Framework for Recommender System [16.62204445256007]
We propose a knowledge distillation framework for recommender system, called DE-RRD.
It enables the student model to learn from the latent knowledge encoded in the teacher model as well as from the teacher's predictions.
Our experiments show that DE-RRD outperforms the state-of-the-art competitors and achieves comparable or even better performance to that of the teacher model with faster inference time.
arXiv Detail & Related papers (2020-12-08T11:09:22Z) - Ensemble Knowledge Distillation for CTR Prediction [46.92149090885551]
We propose a new model training strategy based on knowledge distillation (KD)
KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model.
We propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss.
arXiv Detail & Related papers (2020-11-08T23:37:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.