DynaBERT: Dynamic BERT with Adaptive Width and Depth
- URL: http://arxiv.org/abs/2004.04037v2
- Date: Fri, 9 Oct 2020 08:51:37 GMT
- Title: DynaBERT: Dynamic BERT with Adaptive Width and Depth
- Authors: Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu
- Abstract summary: We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
- Score: 55.18269622415814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pre-trained language models like BERT, though powerful in many natural
language processing tasks, are both computation and memory expensive. To
alleviate this problem, one approach is to compress them for specific tasks
before deployment. However, recent works on BERT compression usually compress
the large BERT model to a fixed smaller size. They can not fully satisfy the
requirements of different edge devices with various hardware performances. In
this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT),
which can flexibly adjust the size and latency by selecting adaptive width and
depth. The training process of DynaBERT includes first training a
width-adaptive BERT and then allowing both adaptive width and depth, by
distilling knowledge from the full-sized model to small sub-networks. Network
rewiring is also used to keep the more important attention heads and neurons
shared by more sub-networks. Comprehensive experiments under various efficiency
constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its
largest size has comparable performance as BERT-base (or RoBERTa-base), while
at smaller widths and depths consistently outperforms existing BERT compression
methods. Code is available at
https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT.
Related papers
- You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z) - Optimizing small BERTs trained for German NER [0.16058099298620418]
We investigate various training techniques of smaller BERT models and evaluate them on five public German NER tasks.
We propose two new fine-tuning techniques leading to better performance: CSE-tagging and a modified form of LCRF.
Furthermore, we introduce a new technique called WWA which reduces BERT memory usage and leads to a small increase in performance.
arXiv Detail & Related papers (2021-04-23T12:36:13Z) - ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks.
Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios.
We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture.
Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z) - Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules.
We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models.
Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - LadaBERT: Lightweight Adaptation of BERT through Hybrid Model
Compression [21.03685890385275]
BERT is a cutting-edge language representation model pre-trained by a large corpus.
BERT is memory-intensive and leads to unsatisfactory latency of user requests.
We propose a hybrid solution named LadaBERT, which combines the advantages of different model compression methods.
arXiv Detail & Related papers (2020-04-08T17:18:56Z) - AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural
Architecture Search [79.98686989604164]
Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks.
We propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks.
We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12.7x to 29.3x faster than BERT in inference time and 11.5x to 17.0x smaller in terms of parameter size.
arXiv Detail & Related papers (2020-01-13T14:03:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.