How to Distill your BERT: An Empirical Study on the Impact of Weight
Initialisation and Distillation Objectives
- URL: http://arxiv.org/abs/2305.15032v1
- Date: Wed, 24 May 2023 11:16:09 GMT
- Title: How to Distill your BERT: An Empirical Study on the Impact of Weight
Initialisation and Distillation Objectives
- Authors: Xinpeng Wang, Leonie Weissweiler, Hinrich Sch\"utze, Barbara Plank
- Abstract summary: We show that attention transfer gives the best performance overall.
We also study the impact of layer choice when initializing the student from the teacher layers.
We release our code as an efficient transformer-based model distillation framework for further studies.
- Score: 18.192124201159594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, various intermediate layer distillation (ILD) objectives have been
shown to improve compression of BERT models via Knowledge Distillation (KD).
However, a comprehensive evaluation of the objectives in both task-specific and
task-agnostic settings is lacking. To the best of our knowledge, this is the
first work comprehensively evaluating distillation objectives in both settings.
We show that attention transfer gives the best performance overall. We also
study the impact of layer choice when initializing the student from the teacher
layers, finding a significant impact on the performance in task-specific
distillation. For vanilla KD and hidden states transfer, initialisation with
lower layers of the teacher gives a considerable improvement over higher
layers, especially on the task of QNLI (up to an absolute percentage change of
17.8 in accuracy). Attention transfer behaves consistently under different
initialisation settings. We release our code as an efficient transformer-based
model distillation framework for further studies.
Related papers
- Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - Class-aware Information for Logit-based Knowledge Distillation [16.634819319915923]
We propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level.
CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance.
arXiv Detail & Related papers (2022-11-27T09:27:50Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Localization Distillation for Object Detection [134.12664548771534]
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the classification logits.
We present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student.
We show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years.
arXiv Detail & Related papers (2022-04-12T17:14:34Z) - Adaptive Instance Distillation for Object Detection in Autonomous
Driving [3.236217153362305]
We propose Adaptive Instance Distillation (AID) to selectively impart teacher's knowledge to the student to improve the performance of knowledge distillation.
Our AID is also shown to be useful for self-distillation to improve the teacher model's performance.
arXiv Detail & Related papers (2022-01-26T18:06:33Z) - Knowledge Distillation for Object Detection via Rank Mimicking and
Prediction-guided Feature Imitation [34.441349114336994]
We propose Rank Mimicking (RM) and Prediction-guided Feature Imitation (PFI) for distilling one-stage detectors.
RM takes the rank of candidate boxes from teachers as a new form of knowledge to distill.
PFI attempts to correlate feature differences with prediction differences, making feature imitation directly help to improve the student's accuracy.
arXiv Detail & Related papers (2021-12-09T11:19:15Z) - Response-based Distillation for Incremental Object Detection [2.337183337110597]
Traditional object detection are ill-equipped for incremental learning.
Fine-tuning directly on a well-trained detection model with only new data will leads to catastrophic forgetting.
We propose a fully response-based incremental distillation method focusing on learning response from detection bounding boxes and classification predictions.
arXiv Detail & Related papers (2021-10-26T08:07:55Z) - ERNIE-Tiny : A Progressive Distillation Framework for Pretrained
Transformer Compression [20.23732233214849]
We propose a four-stage progressive distillation framework ERNIE-Tiny to compress pretrained language models (PLMs)
Experiments show that a 4-layer ERNIE-Tiny maintains over 98.0%performance of its 12-layer teacher BERT base on GLUE benchmark.
ERNIE-Tiny achieves a new compression SOTA on five Chinese NLP tasks, outperforming BERT base by 0.4% accuracy with 7.5x fewer parameters and9.4x faster inference speed.
arXiv Detail & Related papers (2021-06-04T04:00:16Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.