Related papers: MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

URL: http://arxiv.org/abs/2012.15828v1
Date: Thu, 31 Dec 2020 18:51:26 GMT
Title: MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers
Authors: Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, Furu Wei
Abstract summary: We generalize deep self-attention distillation in MiniLM by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, and RoBERTa) outperform the state of the art.
Score: 46.42728702637682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student's attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, and RoBERTa) outperform the state of the art.

Related papers

Progressive distillation induces an implicit curriculum [44.528775476168654]
A better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several teachers. One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher. Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student's learning.
arXiv Detail & Related papers (2024-10-07T19:49:24Z)
Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers. We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students) Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z)
Generic-to-Specific Distillation of Masked Autoencoders [119.21281960831651]
We propose generic-to-specific distillation (G2SD) to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. With G2SD, the vanilla ViT-Small model achieves 98.7%, 98.1% and 99.3% the performance of its teacher for image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T17:13:14Z)
Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers. We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z)
Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems [99.13795374152997]
We propose a neural network designed to distill an ensemble of large transformers into a single smaller model. An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads. Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
arXiv Detail & Related papers (2022-01-15T06:21:01Z)
SiT: Self-supervised vIsion Transformer [23.265568744478333]
In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice. We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets.
arXiv Detail & Related papers (2021-04-08T08:34:04Z)
Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices. Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.