NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search
- URL: http://arxiv.org/abs/2105.14444v1
- Date: Sun, 30 May 2021 07:20:27 GMT
- Title: NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search
- Authors: Jin Xu, Xu Tan, Renqian Luo, Kaitao Song, Jian Li, Tao Qin, Tie-Yan
Liu
- Abstract summary: We propose NAS-BERT, an efficient method for BERT compression.
NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency.
Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
- Score: 100.71365025972258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While pre-trained language models (e.g., BERT) have achieved impressive
results on different natural language processing tasks, they have large numbers
of parameters and suffer from big computational and memory costs, which make
them difficult for real-world deployment. Therefore, model compression is
necessary to reduce the computation and memory cost of pre-trained models. In
this work, we aim to compress BERT and address the following two challenging
practical issues: (1) The compression algorithm should be able to output
multiple compressed models with different sizes and latencies, in order to
support devices with different memory and latency limitations; (2) The
algorithm should be downstream task agnostic, so that the compressed models are
generally applicable for different downstream tasks. We leverage techniques in
neural architecture search (NAS) and propose NAS-BERT, an efficient method for
BERT compression. NAS-BERT trains a big supernet on a search space containing a
variety of architectures and outputs multiple compressed models with adaptive
sizes and latency. Furthermore, the training of NAS-BERT is conducted on
standard self-supervised pre-training tasks (e.g., masked language model) and
does not depend on specific downstream tasks. Thus, the compressed models can
be used across various downstream tasks. The technical challenge of NAS-BERT is
that training a big supernet on the pre-training task is extremely costly. We
employ several techniques including block-wise search, search space pruning,
and performance approximation to improve search efficiency and accuracy.
Extensive experiments on GLUE and SQuAD benchmark datasets demonstrate that
NAS-BERT can find lightweight models with better accuracy than previous
approaches, and can be directly applied to different downstream tasks with
adaptive model sizes for different requirements of memory or latency.
Related papers
- AutoDistil: Few-shot Task-agnostic Neural Architecture Search for
Distilling Large Language Models [121.22644352431199]
We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model.
Current works train a single SuperLM consisting of millions ofworks with weight-sharing.
Experiments on GLUE benchmark against state-of-the-art KD and NAS methods demonstrate AutoDistil to outperform leading compression techniques.
arXiv Detail & Related papers (2022-01-29T06:13:04Z) - UDC: Unified DNAS for Compressible TinyML Models [10.67922101024593]
This work bridges the gap between NPU HW capability and NN model design by proposing a neural arcthiecture search (NAS) algorithm.
We demonstrate Unified DNAS for Compressible models (UDC) on CIFAR100, ImageNet, and DIV2K super resolution tasks.
On ImageNet, we find dominant compressible models, which are 1.9x smaller or 5.76% more accurate.
arXiv Detail & Related papers (2022-01-15T12:35:26Z) - Differentiable Network Pruning for Microcontrollers [14.864940447206871]
We present a differentiable structured network pruning method for convolutional neural networks.
It integrates a model's MCU-specific resource usage and parameter importance feedback to obtain highly compressed yet accurate classification models.
arXiv Detail & Related papers (2021-10-15T20:26:15Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z) - Binarized Neural Architecture Search for Efficient Object Recognition [120.23378346337311]
Binarized neural architecture search (BNAS) produces extremely compressed models to reduce huge computational cost on embedded devices for edge computing.
An accuracy of $96.53%$ vs. $97.22%$ is achieved on the CIFAR-10 dataset, but with a significantly compressed model, and a $40%$ faster search than the state-of-the-art PC-DARTS.
arXiv Detail & Related papers (2020-09-08T15:51:23Z) - Search What You Want: Barrier Panelty NAS for Mixed Precision
Quantization [51.26579110596767]
We propose a novel Barrier Penalty based NAS (BP-NAS) for mixed precision quantization.
BP-NAS sets new state of the arts on both classification (Cifar-10, ImageNet) and detection (COCO)
arXiv Detail & Related papers (2020-07-20T12:00:48Z) - Self-Supervised GAN Compression [32.21713098893454]
We show that a standard model compression technique, weight pruning, cannot be applied to GANs using existing methods.
We then develop a self-supervised compression technique which uses the trained discriminator to supervise the training of a compressed generator.
We show that this framework has a compelling performance to high degrees of sparsity, can be easily applied to new tasks and models, and enables meaningful comparisons between different pruning granularities.
arXiv Detail & Related papers (2020-07-03T04:18:54Z) - AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural
Architecture Search [79.98686989604164]
Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks.
We propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks.
We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12.7x to 29.3x faster than BERT in inference time and 11.5x to 17.0x smaller in terms of parameter size.
arXiv Detail & Related papers (2020-01-13T14:03:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.