ERNIE-Tiny : A Progressive Distillation Framework for Pretrained
Transformer Compression
- URL: http://arxiv.org/abs/2106.02241v1
- Date: Fri, 4 Jun 2021 04:00:16 GMT
- Title: ERNIE-Tiny : A Progressive Distillation Framework for Pretrained
Transformer Compression
- Authors: Weiyue Su, Xuyi Chen, Shikun Feng, Jiaxiang Liu, Weixin Liu, Yu Sun,
Hao Tian, Hua Wu, Haifeng Wang
- Abstract summary: We propose a four-stage progressive distillation framework ERNIE-Tiny to compress pretrained language models (PLMs)
Experiments show that a 4-layer ERNIE-Tiny maintains over 98.0%performance of its 12-layer teacher BERT base on GLUE benchmark.
ERNIE-Tiny achieves a new compression SOTA on five Chinese NLP tasks, outperforming BERT base by 0.4% accuracy with 7.5x fewer parameters and9.4x faster inference speed.
- Score: 20.23732233214849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models (PLMs) such as BERT adopt a training paradigm
which first pretrain the model in general data and then finetune the model on
task-specific data, and have recently achieved great success. However, PLMs are
notorious for their enormous parameters and hard to be deployed on real-life
applications. Knowledge distillation has been prevailing to address this
problem by transferring knowledge from a large teacher to a much smaller
student over a set of data. We argue that the selection of thee three key
components, namely teacher, training data, and learning objective, is crucial
to the effectiveness of distillation. We, therefore, propose a four-stage
progressive distillation framework ERNIE-Tiny to compress PLM, which varies the
three components gradually from general level to task-specific level.
Specifically, the first stage, General Distillation, performs distillation with
guidance from pretrained teacher, gerenal data and latent distillation loss.
Then, General-Enhanced Distillation changes teacher model from pretrained
teacher to finetuned teacher. After that, Task-Adaptive Distillation shifts
training data from general data to task-specific data. In the end,
Task-Specific Distillation, adds two additional losses, namely Soft-Label and
Hard-Label loss onto the last stage. Empirical results demonstrate the
effectiveness of our framework and generalization gain brought by ERNIE-Tiny.In
particular, experiments show that a 4-layer ERNIE-Tiny maintains over
98.0%performance of its 12-layer teacher BERT base on GLUE benchmark,
surpassing state-of-the-art (SOTA) by 1.0% GLUE score with the same amount of
parameters. Moreover, ERNIE-Tiny achieves a new compression SOTA on five
Chinese NLP tasks, outperforming BERT base by 0.4% accuracy with 7.5x fewer
parameters and9.4x faster inference speed.
Related papers
- Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - A Study on Knowledge Distillation from Weak Teacher for Scaling Up
Pre-trained Language Models [104.64899255277443]
Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance.
This study examines three key factors to optimize DWT, distinct from those used in the vision domain or traditional knowledge distillation.
arXiv Detail & Related papers (2023-05-26T13:24:49Z) - Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing [59.58984194238254]
We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization.
Unlike prior works that rely on an extreme-scale teacher model, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs.
By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs.
arXiv Detail & Related papers (2023-05-26T05:19:24Z) - How to Distill your BERT: An Empirical Study on the Impact of Weight
Initialisation and Distillation Objectives [18.192124201159594]
We show that attention transfer gives the best performance overall.
We also study the impact of layer choice when initializing the student from the teacher layers.
We release our code as an efficient transformer-based model distillation framework for further studies.
arXiv Detail & Related papers (2023-05-24T11:16:09Z) - Remember the Past: Distilling Datasets into Addressable Memories for
Neural Networks [27.389093857615876]
We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories.
These memories can then be recalled to quickly re-train a neural network and recover the performance.
We demonstrate state-of-the-art results on the dataset distillation task across five benchmarks.
arXiv Detail & Related papers (2022-06-06T21:32:26Z) - Conditional Generative Data-Free Knowledge Distillation based on
Attention Transfer [0.8594140167290099]
We propose a conditional generative data-free knowledge distillation (CGDD) framework to train efficient portable network without any real data.
In this framework, except using the knowledge extracted from teacher model, we introduce preset labels as additional auxiliary information.
We show that trained portable network learned with proposed data-free distillation method obtains 99.63%, 99.07% and 99.84% relative accuracy on CIFAR10, CIFAR100 and Caltech101.
arXiv Detail & Related papers (2021-12-31T09:23:40Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z) - Extracurricular Learning: Knowledge Transfer Beyond Empirical
Distribution [17.996541285382463]
We propose extracurricular learning to bridge the gap between a compressed student model and its teacher.
We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%.
This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures.
arXiv Detail & Related papers (2020-06-30T18:21:21Z) - Towards Non-task-specific Distillation of BERT via Sentence
Representation Approximation [17.62309851473892]
We propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model.
Our model is able to perform transfer learning via fine-tuning to adapt to any sentence-level downstream task.
The experimental results on multiple NLP tasks from the GLUE benchmark show that our approach outperforms other task-specific distillation methods.
arXiv Detail & Related papers (2020-04-07T03:03:00Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.