MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers
- URL: http://arxiv.org/abs/2002.10957v2
- Date: Mon, 6 Apr 2020 02:53:18 GMT
- Title: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers
- Authors: Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou
- Abstract summary: We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
- Score: 117.67424061746247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its
variants) have achieved remarkable success in varieties of NLP tasks. However,
these models usually consist of hundreds of millions of parameters which brings
challenges for fine-tuning and online serving in real-life applications due to
latency and capacity constraints. In this work, we present a simple and
effective approach to compress large Transformer (Vaswani et al., 2017) based
pre-trained models, termed as deep self-attention distillation. The small model
(student) is trained by deeply mimicking the self-attention module, which plays
a vital role in Transformer networks, of the large model (teacher).
Specifically, we propose distilling the self-attention module of the last
Transformer layer of the teacher, which is effective and flexible for the
student. Furthermore, we introduce the scaled dot-product between values in the
self-attention module as the new deep self-attention knowledge, in addition to
the attention distributions (i.e., the scaled dot-product of queries and keys)
that have been used in existing works. Moreover, we show that introducing a
teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large
pre-trained Transformer models. Experimental results demonstrate that our
monolingual model outperforms state-of-the-art baselines in different parameter
size of student models. In particular, it retains more than 99% accuracy on
SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer
parameters and computations of the teacher model. We also obtain competitive
results in applying deep self-attention distillation to multilingual
pre-trained models.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion [29.297959023968165]
This paper proposes a progressive distillation method based on masked generation features for KGC task.
Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models.
The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-01-19T07:34:36Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Generic-to-Specific Distillation of Masked Autoencoders [119.21281960831651]
We propose generic-to-specific distillation (G2SD) to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders.
With G2SD, the vanilla ViT-Small model achieves 98.7%, 98.1% and 99.3% the performance of its teacher for image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T17:13:14Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - MiniLMv2: Multi-Head Self-Attention Relation Distillation for
Compressing Pretrained Transformers [46.42728702637682]
We generalize deep self-attention distillation in MiniLM by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers.
In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors.
Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, and RoBERTa) outperform the state of the art.
arXiv Detail & Related papers (2020-12-31T18:51:26Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.