Related papers: Knowledge Distillation of Transformer-based Language Models Revisited

Knowledge Distillation of Transformer-based Language Models Revisited

URL: http://arxiv.org/abs/2206.14366v2
Date: Thu, 30 Jun 2022 08:04:06 GMT
Title: Knowledge Distillation of Transformer-based Language Models Revisited
Authors: Chengqiang Lu, Jianwei Zhang, Yunfei Chu, Zhengyu Chen, Jingren Zhou, Fei Wu, Haiqing Chen, Hongxia Yang
Abstract summary: Large model size and high run-time latency are serious impediments to applying pre-trained language models in practice. We propose a unified knowledge distillation framework for transformer-based models. Our empirical results shed light on the distillation in the pre-train language model and with relative significant improvement over previous state-of-the-arts(SOTA)
Score: 74.25427636413067
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In the past few years, transformer-based pre-trained language models have achieved astounding success in both industry and academia. However, the large model size and high run-time latency are serious impediments to applying them in practice, especially on mobile phones and Internet of Things (IoT) devices. To compress the model, considerable literature has grown up around the theme of knowledge distillation (KD) recently. Nevertheless, how KD works in transformer-based models is still unclear. We tease apart the components of KD and propose a unified KD framework. Through the framework, systematic and extensive experiments that spent over 23,000 GPU hours render a comprehensive analysis from the perspectives of knowledge types, matching strategies, width-depth trade-off, initialization, model size, etc. Our empirical results shed light on the distillation in the pre-train language model and with relative significant improvement over previous state-of-the-arts(SOTA). Finally, we provide a best-practice guideline for the KD in transformer-based models.

Related papers

A Comprehensive Survey on Knowledge Distillation [6.3968297708975435]
Knowledge Distillation (KD) is one of the prominent techniques proposed to address the aforementioned problems. This work includes reviewing KD from different aspects: distillation sources, distillation schemes, distillation algorithms, applications of distillation, and comparison among existing methods. This survey considers various critically important subcategories, including KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs.
arXiv Detail & Related papers (2025-03-15T09:48:29Z)
Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models [0.0]
An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility. We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements.
arXiv Detail & Related papers (2024-07-22T14:20:53Z)
What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks. We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z)
Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z)
How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD) We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy. Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z)
HYDRA -- Hyper Dependency Representation Attentions [4.697611383288171]
We propose lightweight pretrained linguistic self-attention heads to inject knowledge into transformer models without pretraining them again. Our approach is a balanced paradigm between leaving the models to learn unsupervised and forcing them to conform to linguistic knowledge rigidly. We empirically verify our framework on benchmark datasets to show the contribution of linguistic knowledge to a transformer model.
arXiv Detail & Related papers (2021-09-11T19:17:34Z)
Ensemble Knowledge Distillation for CTR Prediction [46.92149090885551]
We propose a new model training strategy based on knowledge distillation (KD) KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model. We propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss.
arXiv Detail & Related papers (2020-11-08T23:37:58Z)
Knowledge Distillation: A Survey [87.51063304509067]
Deep neural networks have been successful in both industry and academia, especially for computer vision tasks. It is a challenge to deploy these cumbersome deep models on devices with limited resources. Knowledge distillation effectively learns a small student model from a large teacher model.
arXiv Detail & Related papers (2020-06-09T21:47:17Z)
Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks [39.2907363775529]
Knowledge distillation (KD) has been proposed to transfer information learned from one model to another. This paper is about KD and S-T learning, which are being actively studied in recent years.
arXiv Detail & Related papers (2020-04-13T13:45:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.