EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up
Knowledge Distillation
- URL: http://arxiv.org/abs/2109.07222v2
- Date: Thu, 16 Sep 2021 02:54:42 GMT
- Title: EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up
Knowledge Distillation
- Authors: Chenhe Dong, Guangrun Wang, Hang Xu, Jiefeng Peng, Xiaozhe Ren,
Xiaodan Liang
- Abstract summary: Pre-trained language models have shown remarkable results on various NLP tasks.
Due to their bulky size and slow inference speed, it is hard to deploy them on edge devices.
In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA)
- Score: 82.3956677850676
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models have shown remarkable results on various NLP
tasks. Nevertheless, due to their bulky size and slow inference speed, it is
hard to deploy them on edge devices. In this paper, we have a critical insight
that improving the feed-forward network (FFN) in BERT has a higher gain than
improving the multi-head attention (MHA) since the computational cost of FFN is
2$\sim$3 times larger than MHA. Hence, to compact BERT, we are devoted to
designing efficient FFN as opposed to previous works that pay attention to MHA.
Since FFN comprises a multilayer perceptron (MLP) that is essential in BERT
optimization, we further design a thorough search space towards an advanced MLP
and perform a coarse-to-fine mechanism to search for an efficient BERT
architecture. Moreover, to accelerate searching and enhance model
transferability, we employ a novel warm-up knowledge distillation strategy at
each search stage. Extensive experiments show our searched EfficientBERT is
6.9$\times$ smaller and 4.4$\times$ faster than BERT$\rm_{BASE}$, and has
competitive performances on GLUE and SQuAD Benchmarks. Concretely,
EfficientBERT attains a 77.7 average score on GLUE \emph{test}, 0.7 higher than
MobileBERT$\rm_{TINY}$, and achieves an 85.3/74.5 F1 score on SQuAD v1.1/v2.0
\emph{dev}, 3.2/2.7 higher than TinyBERT$_4$ even without data augmentation.
The code is released at https://github.com/cheneydon/efficient-bert.
Related papers
- HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Merging Experts into One: Improving Computational Efficiency of Mixture
of Experts [71.44422347502409]
A sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters.
Can we retain the advantages of adding more experts without substantially increasing the computational costs?
We propose a computation-efficient approach called textbftexttMerging Experts into One (MEO) which reduces the computation cost to that of a single expert.
arXiv Detail & Related papers (2023-10-15T13:28:42Z) - EfficientViT: Memory Efficient Vision Transformer with Cascaded Group
Attention [44.148667664413004]
We propose a family of high-speed vision transformers named EfficientViT.
We find that the speed of existing transformer models is commonly bounded by memory inefficient operations.
To address this, we present a cascaded group attention module feeding attention heads with different splits.
arXiv Detail & Related papers (2023-05-11T17:59:41Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - TangoBERT: Reducing Inference Cost by using Cascaded Architecture [9.496399437260678]
We present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model.
The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model.
We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task.
arXiv Detail & Related papers (2022-04-13T09:45:08Z) - AutoDistill: an End-to-End Framework to Explore and Distill
Hardware-Efficient Language Models [20.04008357406888]
We propose AutoDistill, an end-to-end model distillation framework for building hardware-efficient NLP pre-trained models.
Experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT.
By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score.
arXiv Detail & Related papers (2022-01-21T04:32:19Z) - Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic
Sequence Length [2.8770761243361593]
TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation.
Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches.
arXiv Detail & Related papers (2021-11-18T11:58:19Z) - DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.