Related papers: ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression

ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression

URL: http://arxiv.org/abs/2106.02241v1
Date: Fri, 4 Jun 2021 04:00:16 GMT
Title: ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression
Authors: Weiyue Su, Xuyi Chen, Shikun Feng, Jiaxiang Liu, Weixin Liu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
Abstract summary: We propose a four-stage progressive distillation framework ERNIE-Tiny to compress pretrained language models (PLMs) Experiments show that a 4-layer ERNIE-Tiny maintains over 98.0%performance of its 12-layer teacher BERT base on GLUE benchmark. ERNIE-Tiny achieves a new compression SOTA on five Chinese NLP tasks, outperforming BERT base by 0.4% accuracy with 7.5x fewer parameters and9.4x faster inference speed.
Score: 20.23732233214849
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained language models (PLMs) such as BERT adopt a training paradigm which first pretrain the model in general data and then finetune the model on task-specific data, and have recently achieved great success. However, PLMs are notorious for their enormous parameters and hard to be deployed on real-life applications. Knowledge distillation has been prevailing to address this problem by transferring knowledge from a large teacher to a much smaller student over a set of data. We argue that the selection of thee three key components, namely teacher, training data, and learning objective, is crucial to the effectiveness of distillation. We, therefore, propose a four-stage progressive distillation framework ERNIE-Tiny to compress PLM, which varies the three components gradually from general level to task-specific level. Specifically, the first stage, General Distillation, performs distillation with guidance from pretrained teacher, gerenal data and latent distillation loss. Then, General-Enhanced Distillation changes teacher model from pretrained teacher to finetuned teacher. After that, Task-Adaptive Distillation shifts training data from general data to task-specific data. In the end, Task-Specific Distillation, adds two additional losses, namely Soft-Label and Hard-Label loss onto the last stage. Empirical results demonstrate the effectiveness of our framework and generalization gain brought by ERNIE-Tiny.In particular, experiments show that a 4-layer ERNIE-Tiny maintains over 98.0%performance of its 12-layer teacher BERT base on GLUE benchmark, surpassing state-of-the-art (SOTA) by 1.0% GLUE score with the same amount of parameters. Moreover, ERNIE-Tiny achieves a new compression SOTA on five Chinese NLP tasks, outperforming BERT base by 0.4% accuracy with 7.5x fewer parameters and9.4x faster inference speed.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Efficient Verified Machine Unlearning For Distillation [6.363158395541767]
PURGE (Partitioned Unlearning with Retraining Guarantee for Ensembles) is a novel framework integrating verified unlearning with distillation. We provide both theoretical analysis, quantifying significant speed-ups in the unlearning process, and empirical validation on multiple datasets.
arXiv Detail & Related papers (2025-03-28T15:38:07Z)
Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z)
A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models [104.64899255277443]
Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance. This study examines three key factors to optimize DWT, distinct from those used in the vision domain or traditional knowledge distillation.
arXiv Detail & Related papers (2023-05-26T13:24:49Z)
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing [59.58984194238254]
We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization. Unlike prior works that rely on an extreme-scale teacher model, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs. By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs.
arXiv Detail & Related papers (2023-05-26T05:19:24Z)
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives [18.192124201159594]
We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers. We release our code as an efficient transformer-based model distillation framework for further studies.
arXiv Detail & Related papers (2023-05-24T11:16:09Z)
Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks [27.389093857615876]
We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories. These memories can then be recalled to quickly re-train a neural network and recover the performance. We demonstrate state-of-the-art results on the dataset distillation task across five benchmarks.
arXiv Detail & Related papers (2022-06-06T21:32:26Z)
Conditional Generative Data-Free Knowledge Distillation based on Attention Transfer [0.8594140167290099]
We propose a conditional generative data-free knowledge distillation (CGDD) framework to train efficient portable network without any real data. In this framework, except using the knowledge extracted from teacher model, we introduce preset labels as additional auxiliary information. We show that trained portable network learned with proposed data-free distillation method obtains 99.63%, 99.07% and 99.84% relative accuracy on CIFAR10, CIFAR100 and Caltech101.
arXiv Detail & Related papers (2021-12-31T09:23:40Z)
Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation. We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z)
Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z)
Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution [17.996541285382463]
We propose extracurricular learning to bridge the gap between a compressed student model and its teacher. We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%. This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures.
arXiv Detail & Related papers (2020-06-30T18:21:21Z)
Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation [17.62309851473892]
We propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model. Our model is able to perform transfer learning via fine-tuning to adapt to any sentence-level downstream task. The experimental results on multiple NLP tasks from the GLUE benchmark show that our approach outperforms other task-specific distillation methods.
arXiv Detail & Related papers (2020-04-07T03:03:00Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.