Efficient Transformer-based Large Scale Language Representations using
Hardware-friendly Block Structured Pruning
- URL: http://arxiv.org/abs/2009.08065v4
- Date: Mon, 16 Nov 2020 22:13:31 GMT
- Title: Efficient Transformer-based Large Scale Language Representations using
Hardware-friendly Block Structured Pruning
- Authors: Bingbing Li, Zhenglun Kong, Tianyun Zhang, Ji Li, Zhengang Li, Hang
Liu, Caiwen Ding
- Abstract summary: We propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning.
Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates.
It is suitable to deploy the final compressed model on resource-constrained edge devices.
- Score: 12.761055946548437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained large-scale language models have increasingly demonstrated high
accuracy on many natural language processing (NLP) tasks. However, the limited
weight storage and computational speed on hardware platforms have impeded the
popularity of pre-trained models, especially in the era of edge computing. In
this work, we propose an efficient transformer-based large-scale language
representation using hardware-friendly block structure pruning. We incorporate
the reweighted group Lasso into block-structured pruning for optimization.
Besides the significantly reduced weight storage and computation, the proposed
approach achieves high compression rates. Experimental results on different
models (BERT, RoBERTa, and DistilBERT) on the General Language Understanding
Evaluation (GLUE) benchmark tasks show that we achieve up to 5.0x with zero or
minor accuracy degradation on certain task(s). Our proposed method is also
orthogonal to existing compact pre-trained language models such as DistilBERT
using knowledge distillation, since a further 1.79x average compression rate
can be achieved on top of DistilBERT with zero or minor accuracy degradation.
It is suitable to deploy the final compressed model on resource-constrained
edge devices.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for
Compact and Efficient language model [0.0]
Excessive overhead leads to large latency and computational costs.
We propose a model accelaration approaches for large language models.
Our model achieves an 18x FLOPs speedup with an accuracy degradation of less than 8% compared to BERT.
arXiv Detail & Related papers (2023-05-21T13:30:56Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for
Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models.
We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information.
We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z) - Prune Once for All: Sparse Pre-Trained Language Models [0.6063525456640462]
We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation.
These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern.
We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.
arXiv Detail & Related papers (2021-11-10T15:52:40Z) - KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language
Models via Knowledge Distillation [5.8287955127529365]
We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition.
We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework.
Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.
arXiv Detail & Related papers (2021-09-13T18:19:30Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.