Combining Compressions for Multiplicative Size Scaling on Natural
Language Tasks
- URL: http://arxiv.org/abs/2208.09684v1
- Date: Sat, 20 Aug 2022 14:01:56 GMT
- Title: Combining Compressions for Multiplicative Size Scaling on Natural
Language Tasks
- Authors: Rajiv Movva, Jinhao Lei, Shayne Longpre, Ajay Gupta, Chris DuBois
- Abstract summary: Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP.
We compare accuracy vs. model size tradeoffs across six BERT architecture sizes and eight GLUE tasks.
We find that quantization and distillation consistently provide greater benefit than pruning.
- Score: 7.813460653362095
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Quantization, knowledge distillation, and magnitude pruning are among the
most popular methods for neural network compression in NLP. Independently,
these methods reduce model size and can accelerate inference, but their
relative benefit and combinatorial interactions have not been rigorously
studied. For each of the eight possible subsets of these techniques, we compare
accuracy vs. model size tradeoffs across six BERT architecture sizes and eight
GLUE tasks. We find that quantization and distillation consistently provide
greater benefit than pruning. Surprisingly, except for the pair of pruning and
quantization, using multiple methods together rarely yields diminishing
returns. Instead, we observe complementary and super-multiplicative reductions
to model size. Our work quantitatively demonstrates that combining compression
methods can synergistically reduce model size, and that practitioners should
prioritize (1) quantization, (2) knowledge distillation, and (3) pruning to
maximize accuracy vs. model size tradeoffs.
Related papers
- Effective Interplay between Sparsity and Quantization: From Theory to Practice [33.697590845745815]
Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy.
We investigate the interaction between these two methods and assess whether their combination impacts final model accuracy.
Our findings extend to the efficient deployment of large models in resource-limited compute platforms and reduce serving cost.
arXiv Detail & Related papers (2024-05-31T15:34:13Z) - CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks [1.5199992713356987]
This paper introduces CompactifAI, an innovative compression approach using quantum-inspired networks.
Our method is versatile and can be implemented with - or on top of - other compression techniques.
As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% memory size of LlaMA 7B.
arXiv Detail & Related papers (2024-01-25T11:45:21Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning)
CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency.
Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z) - Unified Multivariate Gaussian Mixture for Efficient Neural Image
Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression.
We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective.
Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - Block Pruning For Faster Transformers [89.70392810063247]
We introduce a block pruning approach targeting both small and fast models.
We find that this approach learns to prune out full components of the underlying model, such as attention heads.
arXiv Detail & Related papers (2021-09-10T12:46:32Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.