Prune Once for All: Sparse Pre-Trained Language Models
- URL: http://arxiv.org/abs/2111.05754v1
- Date: Wed, 10 Nov 2021 15:52:40 GMT
- Title: Prune Once for All: Sparse Pre-Trained Language Models
- Authors: Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat
- Abstract summary: We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation.
These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern.
We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.
- Score: 0.6063525456640462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based language models are applied to a wide range of applications
in natural language processing. However, they are inefficient and difficult to
deploy. In recent years, many compression algorithms have been proposed to
increase the implementation efficiency of large Transformer-based models on
target hardware. In this work we present a new method for training sparse
pre-trained Transformer language models by integrating weight pruning and model
distillation. These sparse pre-trained models can be used to transfer learning
for a wide range of tasks while maintaining their sparsity pattern. We
demonstrate our method with three known architectures to create sparse
pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed
sparse pre-trained models we trained transfer their knowledge to five different
downstream natural language tasks with minimal accuracy loss. Moreover, we show
how to further compress the sparse models' weights to 8bit precision using
quantization-aware training. For example, with our sparse pre-trained
BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a
compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss.
To the best of our knowledge, our results show the best compression-to-accuracy
ratio for BERT-Base, BERT-Large, and DistilBERT.
Related papers
- Blockwise Compression of Transformer-based Models without Retraining [6.118476907408718]
We propose BCT, a framework of blockwise compression for transformers without retraining.
Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise.
BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results.
arXiv Detail & Related papers (2023-04-04T02:55:40Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for
Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models.
We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information.
We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression.
NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency.
Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z) - Efficient Transformer-based Large Scale Language Representations using
Hardware-friendly Block Structured Pruning [12.761055946548437]
We propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning.
Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates.
It is suitable to deploy the final compressed model on resource-constrained edge devices.
arXiv Detail & Related papers (2020-09-17T04:45:47Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.