Related papers: Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models

Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models

URL: http://arxiv.org/abs/2302.04045v1
Date: Wed, 8 Feb 2023 13:36:06 GMT
Title: Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models
Authors: Mohammadreza Banaei, Klaudia Ba{\l}azy, Artur Kasymov, R\'emi Lebret, Jacek Tabor, Karl Aberer
Abstract summary: transformer language models achieve outstanding results in many natural language processing (NLP) tasks. Their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks. In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model.
Score: 7.542276054279341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent transformer language models achieve outstanding results in many natural language processing (NLP) tasks. However, their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks. In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model. We challenge the classical matrix factorization methods by proposing a novel, better-performing autoencoder-based framework. We perform a comprehensive ablation study of our approach, examining its different aspects over a diverse set of evaluation settings. Moreover, we show that enabling collaboration between modules across layers by compressing certain modules together positively impacts the final model performance. Experiments on various NLP tasks demonstrate that our approach significantly outperforms commonly used factorization-based offline compression methods.

Related papers

Choose Your Model Size: Any Compression of Large Language Models Without Re-Computation [10.376875638696504]
This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off.<n>We use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty.<n>We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z)
SEE: Sememe Entanglement Encoding for Transformer-bases Models Compression [20.824040486029354]
Transformer-based large language models exhibit groundbreaking capabilities, but their storage and computational costs are high, limiting their application in resource-constrained scenarios. An effective approach is to eliminate redundant model parameters and computational costs while incorporating efficient expert-derived knowledge structures to achieve a balance between compression and performance.
arXiv Detail & Related papers (2024-12-15T12:01:43Z)
Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models [0.0]
An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility. We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements.
arXiv Detail & Related papers (2024-07-22T14:20:53Z)
Composable Interventions for Language Models [60.32695044723103]
Test-time interventions for language models can enhance factual accuracy, mitigate harmful outputs, and improve model efficiency without costly retraining. But despite a flood of new methods, different types of interventions are largely developing independently. We introduce composable interventions, a framework to study the effects of using multiple interventions on the same language models.
arXiv Detail & Related papers (2024-07-09T01:17:44Z)
Fast Vocabulary Transfer for Language Model Compression [3.5668409338590195]
We propose a new method for model compression that relies on vocabulary transfer. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques.
arXiv Detail & Related papers (2024-02-15T14:37:07Z)
A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV) Model compression methods reduce the memory and computational cost of Transformer. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z)
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks. However, the massive size of these models poses huge challenges for their deployment in real-world applications. We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z)
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity. We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z)
Exploring Dimensionality Reduction Techniques in Multilingual Transformers [64.78260098263489]
This paper gives a comprehensive account of the impact of dimensional reduction techniques on the performance of state-of-the-art multilingual Siamese Transformers. It shows that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively.
arXiv Detail & Related papers (2022-04-18T17:20:55Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning. We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets. We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z)
Direction is what you need: Improving Word Embedding Compression in Large Language Models [7.736463504706344]
This paper presents a novel loss objective to compress token embeddings in Transformer-based models by leveraging an AutoEncoder architecture. Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity.
arXiv Detail & Related papers (2021-06-15T14:28:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.