Learning Interpretation with Explainable Knowledge Distillation
- URL: http://arxiv.org/abs/2111.06945v1
- Date: Fri, 12 Nov 2021 21:18:06 GMT
- Title: Learning Interpretation with Explainable Knowledge Distillation
- Authors: Raed Alharbi, Minh N. Vu, My T. Thai
- Abstract summary: Knowledge Distillation (KD) has been considered as a key solution in model compression and acceleration in recent years.
We propose a novel explainable knowledge distillation model, called XDistillation, through which both the performance the explanations' information are transferred from the teacher model to the student model.
Our experiments shows that models trained by XDistillation outperform those trained by conventional KD methods in term of predictive accuracy and also faithfulness to the teacher models.
- Score: 28.00216413365036
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Knowledge Distillation (KD) has been considered as a key solution in model
compression and acceleration in recent years. In KD, a small student model is
generally trained from a large teacher model by minimizing the divergence
between the probabilistic outputs of the two. However, as demonstrated in our
experiments, existing KD methods might not transfer critical explainable
knowledge of the teacher to the student, i.e. the explanations of predictions
made by the two models are not consistent. In this paper, we propose a novel
explainable knowledge distillation model, called XDistillation, through which
both the performance the explanations' information are transferred from the
teacher model to the student model. The XDistillation model leverages the idea
of convolutional autoencoders to approximate the teacher explanations. Our
experiments shows that models trained by XDistillation outperform those trained
by conventional KD methods not only in term of predictive accuracy but also
faithfulness to the teacher models.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Improving Knowledge Distillation with Teacher's Explanation [14.935696904019146]
We introduce a novel Knowledge Explaining Distillation (KED) framework.
KED allows the student to learn not only from the teacher's predictions but also from the teacher's explanations.
Our experiments over a variety of datasets show that KED students can substantially outperform KD students of similar complexity.
arXiv Detail & Related papers (2023-10-04T04:18:01Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - AD-KD: Attribution-Driven Knowledge Distillation for Language Model
Compression [26.474962405945316]
We present a novel attribution-driven knowledge distillation approach to compress pre-trained language models.
To enhance the knowledge transfer of model reasoning and generalization, we explore multi-view attribution distillation on all potential decisions of the teacher.
arXiv Detail & Related papers (2023-05-17T07:40:12Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - DE-RRD: A Knowledge Distillation Framework for Recommender System [16.62204445256007]
We propose a knowledge distillation framework for recommender system, called DE-RRD.
It enables the student model to learn from the latent knowledge encoded in the teacher model as well as from the teacher's predictions.
Our experiments show that DE-RRD outperforms the state-of-the-art competitors and achieves comparable or even better performance to that of the teacher model with faster inference time.
arXiv Detail & Related papers (2020-12-08T11:09:22Z) - Joint learning of interpretation and distillation [7.412850488684037]
This paper conducts an empirical study on the new approach to explaining each prediction of GBDT2NN.
Experiments on several benchmarks show that the proposed methods achieve better performance on both explanations and predictions.
arXiv Detail & Related papers (2020-05-24T02:01:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.