MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
- URL: http://arxiv.org/abs/2004.02984v2
- Date: Tue, 14 Apr 2020 23:54:36 GMT
- Title: MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
- Authors: Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny
Zhou
- Abstract summary: We propose MobileBERT for compressing and accelerating the popular BERT model.
MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE.
On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE)
On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE)
- Score: 43.745884629703994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural Language Processing (NLP) has recently achieved great success by
using huge pre-trained models with hundreds of millions of parameters. However,
these models suffer from heavy model sizes and high latency such that they
cannot be deployed to resource-limited mobile devices. In this paper, we
propose MobileBERT for compressing and accelerating the popular BERT model.
Like the original BERT, MobileBERT is task-agnostic, that is, it can be
generically applied to various downstream NLP tasks via simple fine-tuning.
Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with
bottleneck structures and a carefully designed balance between self-attentions
and feed-forward networks. To train MobileBERT, we first train a specially
designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model.
Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical
studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE
while achieving competitive results on well-known benchmarks. On the natural
language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6
lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD
v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of
90.0/79.2 (1.5/2.1 higher than BERT_BASE).
Related papers
- oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - AutoDistill: an End-to-End Framework to Explore and Distill
Hardware-Efficient Language Models [20.04008357406888]
We propose AutoDistill, an end-to-end model distillation framework for building hardware-efficient NLP pre-trained models.
Experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT.
By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score.
arXiv Detail & Related papers (2022-01-21T04:32:19Z) - EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up
Knowledge Distillation [82.3956677850676]
Pre-trained language models have shown remarkable results on various NLP tasks.
Due to their bulky size and slow inference speed, it is hard to deploy them on edge devices.
In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA)
arXiv Detail & Related papers (2021-09-15T11:25:39Z) - RefBERT: Compressing BERT by Referencing to Pre-computed Representations [19.807272592342148]
RefBERT can beat the vanilla TinyBERT over 8.1% and achieves more than 94% of the performance of $BERTBASE$ on the GLUE benchmark.
RefBERT is 7.4x smaller and 9.5x faster on inference than BERT$_rm BASE$.
arXiv Detail & Related papers (2021-06-11T01:22:08Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.