You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient
- URL: http://arxiv.org/abs/2106.02435v1
- Date: Fri, 4 Jun 2021 12:17:44 GMT
- Title: You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient
- Authors: Shaokun Zhang, Xiawu Zheng, Chenyi Yang, Yuchao Li, Yan Wang, Fei
Chao, Mengdi Wang, Shen Li, Jun Yang, Rongrong Ji
- Abstract summary: Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
- Score: 88.58536093633167
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite superior performance on various natural language processing tasks,
pre-trained models such as BERT are challenged by deploying on
resource-constraint devices. Most existing model compression approaches require
re-compression or fine-tuning across diverse constraints to accommodate various
hardware deployments. This practically limits the further application of model
compression. Moreover, the ineffective training and searching process of
existing elastic compression paradigms[4,27] prevents the direct migration to
BERT compression. Motivated by the necessity of efficient inference across
various constraints on BERT, we propose a novel approach, YOCO-BERT, to achieve
compress once and deploy everywhere. Specifically, we first construct a huge
search space with 10^13 architectures, which covers nearly all configurations
in BERT model. Then, we propose a novel stochastic nature gradient optimization
method to guide the generation of optimal candidate architecture which could
keep a balanced trade-off between explorations and exploitation. When a certain
resource constraint is given, a lightweight distribution optimization approach
is utilized to obtain the optimal network for target deployment without
fine-tuning. Compared with state-of-the-art algorithms, YOCO-BERT provides more
compact models, yet achieving 2.1%-4.5% average accuracy improvement on the
GLUE benchmark. Besides, YOCO-BERT is also more effective, e.g.,the training
complexity is O(1)for N different devices. Code is
availablehttps://github.com/MAC-AutoML/YOCO-BERT.
Related papers
- Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for
Natural Language Understanding [20.75335227098455]
Large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks.
New hardware supporting both NM semi-structured sparsity and low-precision integer computation is a promising solution to boost model serving efficiency.
We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization.
arXiv Detail & Related papers (2022-06-30T04:33:50Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression.
NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency.
Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z) - ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks.
Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios.
We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture.
Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z) - Neural Network Compression Via Sparse Optimization [23.184290795230897]
We propose a model compression framework based on the recent progress on sparse optimization.
We achieve up to 7.2 and 2.9 times FLOPs reduction with the same level of evaluation of accuracy on VGG16 for CIFAR10 and ResNet50 for ImageNet.
arXiv Detail & Related papers (2020-11-10T03:03:55Z) - GAN Slimming: All-in-One GAN Compression by A Unified Optimization
Framework [94.26938614206689]
We propose the first unified optimization framework combining multiple compression means for GAN compression, dubbed GAN Slimming.
We apply GS to compress CartoonGAN, a state-of-the-art style transfer network, by up to 47 times, with minimal visual quality degradation.
arXiv Detail & Related papers (2020-08-25T14:39:42Z) - DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z) - AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural
Architecture Search [79.98686989604164]
Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks.
We propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks.
We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12.7x to 29.3x faster than BERT in inference time and 11.5x to 17.0x smaller in terms of parameter size.
arXiv Detail & Related papers (2020-01-13T14:03:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.