AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural
Architecture Search
- URL: http://arxiv.org/abs/2001.04246v2
- Date: Fri, 22 Jan 2021 10:58:24 GMT
- Title: AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural
Architecture Search
- Authors: Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin
Ding, Hongbo Deng, Jun Huang, Wei Lin, Jingren Zhou
- Abstract summary: Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks.
We propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks.
We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12.7x to 29.3x faster than BERT in inference time and 11.5x to 17.0x smaller in terms of parameter size.
- Score: 79.98686989604164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large pre-trained language models such as BERT have shown their effectiveness
in various natural language processing tasks. However, the huge parameter size
makes them difficult to be deployed in real-time applications that require
quick inference with limited resources. Existing methods compress BERT into
small models while such compression is task-independent, i.e., the same
compressed BERT for all different downstream tasks. Motivated by the necessity
and benefits of task-oriented BERT compression, we propose a novel compression
method, AdaBERT, that leverages differentiable Neural Architecture Search to
automatically compress BERT into task-adaptive small models for specific tasks.
We incorporate a task-oriented knowledge distillation loss to provide search
hints and an efficiency-aware loss as search constraints, which enables a good
trade-off between efficiency and effectiveness for task-adaptive BERT
compression. We evaluate AdaBERT on several NLP tasks, and the results
demonstrate that those task-adaptive compressed models are 12.7x to 29.3x
faster than BERT in inference time and 11.5x to 17.0x smaller in terms of
parameter size, while comparable performance is maintained.
Related papers
- Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z) - NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression.
NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency.
Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z) - ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks.
Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios.
We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture.
Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - LadaBERT: Lightweight Adaptation of BERT through Hybrid Model
Compression [21.03685890385275]
BERT is a cutting-edge language representation model pre-trained by a large corpus.
BERT is memory-intensive and leads to unsatisfactory latency of user requests.
We propose a hybrid solution named LadaBERT, which combines the advantages of different model compression methods.
arXiv Detail & Related papers (2020-04-08T17:18:56Z) - DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.