A Comprehensive Survey of Compression Algorithms for Language Models
- URL: http://arxiv.org/abs/2401.15347v1
- Date: Sat, 27 Jan 2024 08:38:56 GMT
- Title: A Comprehensive Survey of Compression Algorithms for Language Models
- Authors: Seungcheol Park, Jaehyeon Choi, Sojin Lee, and U Kang
- Abstract summary: We survey and summarize diverse compression algorithms including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design.
We discuss the value of each category of compression algorithms, and the desired properties of low-cost compression algorithms which have a significant impact due to the emergence of large language models.
- Score: 10.21587168771851
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How can we compress language models without sacrificing accuracy? The number
of compression algorithms for language models is rapidly growing to benefit
from remarkable advances of recent language models without side effects due to
the gigantic size of language models, such as increased carbon emissions and
expensive maintenance fees. While numerous compression algorithms have shown
remarkable progress in compressing language models, it ironically becomes
challenging to capture emerging trends and identify the fundamental concepts
underlying them due to the excessive number of algorithms. In this paper, we
survey and summarize diverse compression algorithms including pruning,
quantization, knowledge distillation, low-rank approximation, parameter
sharing, and efficient architecture design. We not only summarize the overall
trend of diverse compression algorithms but also select representative
algorithms and provide in-depth analyses of them. We discuss the value of each
category of compression algorithms, and the desired properties of low-cost
compression algorithms which have a significant impact due to the emergence of
large language models. Finally, we introduce promising future research topics
based on our survey results.
Related papers
- Ranking LLMs by compression [13.801767671391604]
We use five large language models as priors for compression, then compare their performance on challenging natural language processing tasks.
Experimental results show that compression ratio and model performance are positively correlated, so it can be used as a general metric to evaluate large language models.
arXiv Detail & Related papers (2024-06-20T10:23:38Z) - Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance [34.641079276516926]
We argue for the theoretical importance of compression, that can be viewed as 0-gram language modeling.
We demonstrate the empirical importance of compression for downstream success of pre-trained language models.
We show that there is a correlation between tokenizers' compression and models' downstream performance.
arXiv Detail & Related papers (2024-03-10T17:02:53Z) - Model Compression and Efficient Inference for Large Language Models: A
Survey [20.199282252344396]
Large language models have two prominent characteristics compared to smaller models.
The most notable aspect of large models is the very high cost associated with model finetuning or training.
Large models emphasize versatility and generalization rather than performance on a single task.
arXiv Detail & Related papers (2024-02-15T06:58:30Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - LoSparse: Structured Compression of Large Language Models based on
Low-Rank and Sparse Approximation [63.04361850630079]
Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large.
We propose LoSparse, a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix.
We show that it significantly outperforms existing compression methods.
arXiv Detail & Related papers (2023-06-20T01:16:11Z) - Revisiting Offline Compression: Going Beyond Factorization-based Methods
for Transformer Language Models [7.542276054279341]
transformer language models achieve outstanding results in many natural language processing (NLP) tasks.
Their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks.
In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model.
arXiv Detail & Related papers (2023-02-08T13:36:06Z) - Intriguing Properties of Compression on Multilingual Models [17.06142742945346]
We propose a framework to characterize the impact of sparsifying multilingual pre-trained language models during fine-tuning.
Applying this framework to mBERT named entity recognition models across 40 languages, we find that compression confers several intriguing and previously unknown generalization properties.
arXiv Detail & Related papers (2022-11-04T20:28:01Z) - What Do Compressed Multilingual Machine Translation Models Forget? [102.50127671423752]
We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases.
We demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.
arXiv Detail & Related papers (2022-05-22T13:54:44Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - What do Compressed Large Language Models Forget? Robustness Challenges
in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning.
We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets.
We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.