Learn From the Past: Experience Ensemble Knowledge Distillation
- URL: http://arxiv.org/abs/2202.12488v1
- Date: Fri, 25 Feb 2022 04:05:09 GMT
- Title: Learn From the Past: Experience Ensemble Knowledge Distillation
- Authors: Chaofei Wang, Shaowei Zhang, Shiji Song, Gao Huang
- Abstract summary: We propose a novel knowledge distillation method by integrating the teacher's experience for knowledge transfer.
We save a moderate number of intermediate models from the training process of the teacher model uniformly, and then integrate the knowledge of these intermediate models by ensemble technique.
A surprising conclusion is found that strong ensemble teachers do not necessarily produce strong students.
- Score: 34.561007802532224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional knowledge distillation transfers "dark knowledge" of a
pre-trained teacher network to a student network, and ignores the knowledge in
the training process of the teacher, which we call teacher's experience.
However, in realistic educational scenarios, learning experience is often more
important than learning results. In this work, we propose a novel knowledge
distillation method by integrating the teacher's experience for knowledge
transfer, named experience ensemble knowledge distillation (EEKD). We save a
moderate number of intermediate models from the training process of the teacher
model uniformly, and then integrate the knowledge of these intermediate models
by ensemble technique. A self-attention module is used to adaptively assign
weights to different intermediate models in the process of knowledge transfer.
Three principles of constructing EEKD on the quality, weights and number of
intermediate models are explored. A surprising conclusion is found that strong
ensemble teachers do not necessarily produce strong students. The experimental
results on CIFAR-100 and ImageNet show that EEKD outperforms the mainstream
knowledge distillation methods and achieves the state-of-the-art. In
particular, EEKD even surpasses the standard ensemble distillation on the
premise of saving training cost.
Related papers
- AD-KD: Attribution-Driven Knowledge Distillation for Language Model
Compression [26.474962405945316]
We present a novel attribution-driven knowledge distillation approach to compress pre-trained language models.
To enhance the knowledge transfer of model reasoning and generalization, we explore multi-view attribution distillation on all potential decisions of the teacher.
arXiv Detail & Related papers (2023-05-17T07:40:12Z) - Improved Knowledge Distillation for Pre-trained Language Models via
Knowledge Selection [35.515135913846386]
We propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation.
Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.
arXiv Detail & Related papers (2023-02-01T13:40:19Z) - Revisiting Knowledge Distillation: An Inheritance and Exploration
Framework [153.73692961660964]
Knowledge Distillation (KD) is a popular technique to transfer knowledge from a teacher model to a student model.
We propose a novel inheritance and exploration knowledge distillation framework (IE-KD)
Our IE-KD framework is generic and can be easily combined with existing distillation or mutual learning methods for training deep neural networks.
arXiv Detail & Related papers (2021-07-01T02:20:56Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z) - Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT)
It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way.
The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z) - Towards Understanding Ensemble, Knowledge Distillation and
Self-Distillation in Deep Learning [93.18238573921629]
We study how Ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model.
We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory.
We prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
arXiv Detail & Related papers (2020-12-17T18:34:45Z) - Learning From Multiple Experts: Self-paced Knowledge Distillation for
Long-tailed Classification [106.08067870620218]
We propose a self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME)
We refer to these models as 'Experts', and the proposed LFME framework aggregates the knowledge from multiple 'Experts' to learn a unified student model.
We conduct extensive experiments and demonstrate that our method is able to achieve superior performances compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-06T12:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.