ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques
- URL: http://arxiv.org/abs/2103.11367v1
- Date: Sun, 21 Mar 2021 11:33:33 GMT
- Title: ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques
- Authors: Yuanxin Liu and Zheng Lin and Fengcheng Yuan
- Abstract summary: Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks.
Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios.
We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture.
Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
- Score: 10.983311133796745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models of the BERT family have defined the
state-of-the-arts in a wide range of NLP tasks. However, the performance of
BERT-based models is mainly driven by the enormous amount of parameters, which
hinders their application to resource-limited scenarios. Faced with this
problem, recent studies have been attempting to compress BERT into a
small-scale model. However, most previous work primarily focuses on a single
kind of compression technique, and few attention has been paid to the
combination of different methods. When BERT is compressed with integrated
techniques, a critical question is how to design the entire compression
framework to obtain the optimal performance. In response to this question, we
integrate three kinds of compression methods (weight pruning, low-rank
factorization and knowledge distillation (KD)) and explore a range of designs
concerning model architecture, KD strategy, pruning frequency and learning rate
schedule. We find that a careful choice of the designs is crucial to the
performance of the compressed model. Based on the empirical findings, our best
compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques
(ROSITA), is $7.5 \times$ smaller than BERT while maintains $98.5\%$ of the
performance on five tasks of the GLUE benchmark, outperforming the previous
BERT compression methods with similar parameter budget. The code is available
at https://github.com/llyx97/Rosita.
Related papers
- Exploring Extreme Parameter Compression for Pre-trained Language Models [45.80044281531393]
This work explores larger compression ratios for pre-trained language models (PLMs)
Two decomposition and reconstruction protocols are proposed to improve the effectiveness and efficiency during compression.
A tiny version achieves $96.7%$ performance of BERT-base with $ 1/48 $ encoder parameters and $2.7 times$ faster on inference.
arXiv Detail & Related papers (2022-05-20T09:16:55Z) - The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for
Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models.
We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information.
We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z) - BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization.
We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes.
Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - LadaBERT: Lightweight Adaptation of BERT through Hybrid Model
Compression [21.03685890385275]
BERT is a cutting-edge language representation model pre-trained by a large corpus.
BERT is memory-intensive and leads to unsatisfactory latency of user requests.
We propose a hybrid solution named LadaBERT, which combines the advantages of different model compression methods.
arXiv Detail & Related papers (2020-04-08T17:18:56Z) - DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z) - AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural
Architecture Search [79.98686989604164]
Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks.
We propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks.
We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12.7x to 29.3x faster than BERT in inference time and 11.5x to 17.0x smaller in terms of parameter size.
arXiv Detail & Related papers (2020-01-13T14:03:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.