Improved Knowledge Distillation for Pre-trained Language Models via
Knowledge Selection
- URL: http://arxiv.org/abs/2302.00444v1
- Date: Wed, 1 Feb 2023 13:40:19 GMT
- Title: Improved Knowledge Distillation for Pre-trained Language Models via
Knowledge Selection
- Authors: Chenglong Wang, Yi Lu, Yongyu Mu, Yimin Hu, Tong Xiao and Jingbo Zhu
- Abstract summary: We propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation.
Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.
- Score: 35.515135913846386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation addresses the problem of transferring knowledge from a
teacher model to a student model. In this process, we typically have multiple
types of knowledge extracted from the teacher model. The problem is to make
full use of them to train the student model. Our preliminary study shows that:
(1) not all of the knowledge is necessary for learning a good student model,
and (2) knowledge distillation can benefit from certain knowledge at different
training steps. In response to these, we propose an actor-critic approach to
selecting appropriate knowledge to transfer during the process of knowledge
distillation. In addition, we offer a refinement of the training algorithm to
ease the computational burden. Experimental results on the GLUE datasets show
that our method outperforms several strong knowledge distillation baselines
significantly.
Related papers
- AD-KD: Attribution-Driven Knowledge Distillation for Language Model
Compression [26.474962405945316]
We present a novel attribution-driven knowledge distillation approach to compress pre-trained language models.
To enhance the knowledge transfer of model reasoning and generalization, we explore multi-view attribution distillation on all potential decisions of the teacher.
arXiv Detail & Related papers (2023-05-17T07:40:12Z) - Adaptively Integrated Knowledge Distillation and Prediction Uncertainty
for Continual Learning [71.43841235954453]
Current deep learning models often suffer from catastrophic forgetting of old knowledge when continually learning new knowledge.
Existing strategies to alleviate this issue often fix the trade-off between keeping old knowledge (stability) and learning new knowledge (plasticity)
arXiv Detail & Related papers (2023-01-18T05:36:06Z) - Anti-Retroactive Interference for Lifelong Learning [65.50683752919089]
We design a paradigm for lifelong learning based on meta-learning and associative mechanism of the brain.
It tackles the problem from two aspects: extracting knowledge and memorizing knowledge.
It is theoretically analyzed that the proposed learning paradigm can make the models of different tasks converge to the same optimum.
arXiv Detail & Related papers (2022-08-27T09:27:36Z) - Learn From the Past: Experience Ensemble Knowledge Distillation [34.561007802532224]
We propose a novel knowledge distillation method by integrating the teacher's experience for knowledge transfer.
We save a moderate number of intermediate models from the training process of the teacher model uniformly, and then integrate the knowledge of these intermediate models by ensemble technique.
A surprising conclusion is found that strong ensemble teachers do not necessarily produce strong students.
arXiv Detail & Related papers (2022-02-25T04:05:09Z) - Extracting knowledge from features with multilevel abstraction [3.4443503349903124]
Self-knowledge distillation (SKD) aims at transferring the knowledge from a large teacher model to a small student model.
In this paper, we purpose a novel SKD method in a different way from the main stream methods.
Experiments and ablation studies show its great effectiveness and generalization on various kinds of tasks.
arXiv Detail & Related papers (2021-12-04T02:25:46Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z) - Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT)
It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way.
The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z) - Towards a Universal Continuous Knowledge Base [49.95342223987143]
We propose a method for building a continuous knowledge base that can store knowledge imported from multiple neural networks.
Experiments on text classification show promising results.
We import the knowledge from multiple models to the knowledge base, from which the fused knowledge is exported back to a single model.
arXiv Detail & Related papers (2020-12-25T12:27:44Z) - A Selective Survey on Versatile Knowledge Distillation Paradigm for
Neural Network Models [3.770437296936382]
We review the characteristics of knowledge distillation from the hypothesis that the three important ingredients of knowledge distillation are distilled knowledge and loss, teacher-student paradigm, and the distillation process.
We present some future works in knowledge distillation including explainable knowledge distillation where the analytical analysis of the performance gain is studied and the self-supervised learning which is a hot research topic in deep learning community.
arXiv Detail & Related papers (2020-11-30T05:22:02Z) - Introspective Learning by Distilling Knowledge from Online
Self-explanation [36.91213895208838]
We propose an implementation of introspective learning by distilling knowledge from online self-explanations.
The models trained with the introspective learning procedure outperform the ones trained with the standard learning procedure.
arXiv Detail & Related papers (2020-09-19T02:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.