Knowledge Distillation of Transformer-based Language Models Revisited
- URL: http://arxiv.org/abs/2206.14366v2
- Date: Thu, 30 Jun 2022 08:04:06 GMT
- Title: Knowledge Distillation of Transformer-based Language Models Revisited
- Authors: Chengqiang Lu, Jianwei Zhang, Yunfei Chu, Zhengyu Chen, Jingren Zhou,
Fei Wu, Haiqing Chen, Hongxia Yang
- Abstract summary: Large model size and high run-time latency are serious impediments to applying pre-trained language models in practice.
We propose a unified knowledge distillation framework for transformer-based models.
Our empirical results shed light on the distillation in the pre-train language model and with relative significant improvement over previous state-of-the-arts(SOTA)
- Score: 74.25427636413067
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the past few years, transformer-based pre-trained language models have
achieved astounding success in both industry and academia. However, the large
model size and high run-time latency are serious impediments to applying them
in practice, especially on mobile phones and Internet of Things (IoT) devices.
To compress the model, considerable literature has grown up around the theme of
knowledge distillation (KD) recently. Nevertheless, how KD works in
transformer-based models is still unclear. We tease apart the components of KD
and propose a unified KD framework. Through the framework, systematic and
extensive experiments that spent over 23,000 GPU hours render a comprehensive
analysis from the perspectives of knowledge types, matching strategies,
width-depth trade-off, initialization, model size, etc. Our empirical results
shed light on the distillation in the pre-train language model and with
relative significant improvement over previous state-of-the-arts(SOTA).
Finally, we provide a best-practice guideline for the KD in transformer-based
models.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models [0.0]
An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility.
We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements.
arXiv Detail & Related papers (2024-07-22T14:20:53Z) - What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters.
Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks.
We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - HYDRA -- Hyper Dependency Representation Attentions [4.697611383288171]
We propose lightweight pretrained linguistic self-attention heads to inject knowledge into transformer models without pretraining them again.
Our approach is a balanced paradigm between leaving the models to learn unsupervised and forcing them to conform to linguistic knowledge rigidly.
We empirically verify our framework on benchmark datasets to show the contribution of linguistic knowledge to a transformer model.
arXiv Detail & Related papers (2021-09-11T19:17:34Z) - Ensemble Knowledge Distillation for CTR Prediction [46.92149090885551]
We propose a new model training strategy based on knowledge distillation (KD)
KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model.
We propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss.
arXiv Detail & Related papers (2020-11-08T23:37:58Z) - Knowledge Distillation: A Survey [87.51063304509067]
Deep neural networks have been successful in both industry and academia, especially for computer vision tasks.
It is a challenge to deploy these cumbersome deep models on devices with limited resources.
Knowledge distillation effectively learns a small student model from a large teacher model.
arXiv Detail & Related papers (2020-06-09T21:47:17Z) - Knowledge Distillation and Student-Teacher Learning for Visual
Intelligence: A Review and New Outlooks [39.2907363775529]
Knowledge distillation (KD) has been proposed to transfer information learned from one model to another.
This paper is about KD and S-T learning, which are being actively studied in recent years.
arXiv Detail & Related papers (2020-04-13T13:45:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.