Reinforced Multi-Teacher Selection for Knowledge Distillation
- URL: http://arxiv.org/abs/2012.06048v2
- Date: Mon, 14 Dec 2020 02:48:07 GMT
- Title: Reinforced Multi-Teacher Selection for Knowledge Distillation
- Authors: Fei Yuan, Linjun Shou, Jian Pei, Wutao Lin, Ming Gong, Yan Fu, Daxin
Jiang
- Abstract summary: knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
- Score: 54.72886763796232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In natural language processing (NLP) tasks, slow inference speed and huge
footprints in GPU usage remain the bottleneck of applying pre-trained deep
models in production. As a popular method for model compression, knowledge
distillation transfers knowledge from one or multiple large (teacher) models to
a small (student) model. When multiple teacher models are available in
distillation, the state-of-the-art methods assign a fixed weight to a teacher
model in the whole distillation. Furthermore, most of the existing methods
allocate an equal weight to every teacher model. In this paper, we observe
that, due to the complexity of training examples and the differences in student
model capability, learning differentially from teacher models can lead to
better performance of student models distilled. We systematically develop a
reinforced method to dynamically assign weights to teacher models for different
training instances and optimize the performance of student model. Our extensive
experimental results on several NLP tasks clearly verify the feasibility and
effectiveness of our approach.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion [29.297959023968165]
This paper proposes a progressive distillation method based on masked generation features for KGC task.
Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models.
The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-01-19T07:34:36Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Self-Feature Regularization: Self-Feature Distillation Without Teacher
Models [0.0]
Self-Feature Regularization(SFR) is proposed, which uses features in the deep layers to supervise feature learning in the shallow layers.
We firstly use generalization-l2 loss to match local features and a many-to-one approach to distill more intensively in the channel dimension.
arXiv Detail & Related papers (2021-03-12T15:29:00Z) - Learning to Reweight with Deep Interactions [104.68509759134878]
We propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model.
Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.
arXiv Detail & Related papers (2020-07-09T09:06:31Z) - Triplet Loss for Knowledge Distillation [2.683996597055128]
The purpose of knowledge distillation is to increase the similarity between the teacher model and the student model.
In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples.
We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved.
arXiv Detail & Related papers (2020-04-17T08:48:29Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.