Revisiting Offline Compression: Going Beyond Factorization-based Methods
for Transformer Language Models
- URL: http://arxiv.org/abs/2302.04045v1
- Date: Wed, 8 Feb 2023 13:36:06 GMT
- Title: Revisiting Offline Compression: Going Beyond Factorization-based Methods
for Transformer Language Models
- Authors: Mohammadreza Banaei, Klaudia Ba{\l}azy, Artur Kasymov, R\'emi Lebret,
Jacek Tabor, Karl Aberer
- Abstract summary: transformer language models achieve outstanding results in many natural language processing (NLP) tasks.
Their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks.
In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model.
- Score: 7.542276054279341
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent transformer language models achieve outstanding results in many
natural language processing (NLP) tasks. However, their enormous size often
makes them impractical on memory-constrained devices, requiring practitioners
to compress them to smaller networks. In this paper, we explore offline
compression methods, meaning computationally-cheap approaches that do not
require further fine-tuning of the compressed model. We challenge the classical
matrix factorization methods by proposing a novel, better-performing
autoencoder-based framework. We perform a comprehensive ablation study of our
approach, examining its different aspects over a diverse set of evaluation
settings. Moreover, we show that enabling collaboration between modules across
layers by compressing certain modules together positively impacts the final
model performance. Experiments on various NLP tasks demonstrate that our
approach significantly outperforms commonly used factorization-based offline
compression methods.
Related papers
- Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models [0.0]
An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility.
We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements.
arXiv Detail & Related papers (2024-07-22T14:20:53Z) - Composable Interventions for Language Models [60.32695044723103]
Test-time interventions for language models can enhance factual accuracy, mitigate harmful outputs, and improve model efficiency without costly retraining.
But despite a flood of new methods, different types of interventions are largely developing independently.
We introduce composable interventions, a framework to study the effects of using multiple interventions on the same language models.
arXiv Detail & Related papers (2024-07-09T01:17:44Z) - Fast Vocabulary Transfer for Language Model Compression [3.5668409338590195]
We propose a new method for model compression that relies on vocabulary transfer.
Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques.
arXiv Detail & Related papers (2024-02-15T14:37:07Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z) - Exploring Dimensionality Reduction Techniques in Multilingual
Transformers [64.78260098263489]
This paper gives a comprehensive account of the impact of dimensional reduction techniques on the performance of state-of-the-art multilingual Siamese Transformers.
It shows that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively.
arXiv Detail & Related papers (2022-04-18T17:20:55Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - What do Compressed Large Language Models Forget? Robustness Challenges
in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning.
We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets.
We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z) - Direction is what you need: Improving Word Embedding Compression in
Large Language Models [7.736463504706344]
This paper presents a novel loss objective to compress token embeddings in Transformer-based models by leveraging an AutoEncoder architecture.
Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity.
arXiv Detail & Related papers (2021-06-15T14:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.