LoSparse: Structured Compression of Large Language Models based on
Low-Rank and Sparse Approximation
- URL: http://arxiv.org/abs/2306.11222v2
- Date: Mon, 26 Jun 2023 15:34:57 GMT
- Title: LoSparse: Structured Compression of Large Language Models based on
Low-Rank and Sparse Approximation
- Authors: Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu
Chen, Tuo Zhao
- Abstract summary: Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large.
We propose LoSparse, a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix.
We show that it significantly outperforms existing compression methods.
- Score: 63.04361850630079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have achieved remarkable results in various natural
language tasks, but they are often prohibitively large, requiring massive
memories and computational resources. To reduce the size and complexity of
these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel
model compression technique that approximates a weight matrix by the sum of a
low-rank matrix and a sparse matrix. Our method combines the advantages of both
low-rank approximations and pruning, while avoiding their limitations. Low-rank
approximation compresses the coherent and expressive parts in neurons, while
pruning removes the incoherent and non-expressive parts in neurons. Pruning
enhances the diversity of low-rank approximations, and low-rank approximation
prevents pruning from losing too many expressive neurons. We evaluate our
method on natural language understanding, question answering, and natural
language generation tasks. We show that it significantly outperforms existing
compression methods.
Related papers
- Differential error feedback for communication-efficient decentralized learning [48.924131251745266]
We propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback.
We show that the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate.
The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
arXiv Detail & Related papers (2024-06-26T15:11:26Z) - Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [40.15915011575071]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.
We conduct empirical research on the low-rank characteristics of large models.
We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - A Comprehensive Survey of Compression Algorithms for Language Models [10.21587168771851]
We survey and summarize diverse compression algorithms including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design.
We discuss the value of each category of compression algorithms, and the desired properties of low-cost compression algorithms which have a significant impact due to the emergence of large language models.
arXiv Detail & Related papers (2024-01-27T08:38:56Z) - CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks [1.5199992713356987]
This paper introduces CompactifAI, an innovative compression approach using quantum-inspired networks.
Our method is versatile and can be implemented with - or on top of - other compression techniques.
As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% memory size of LlaMA 7B.
arXiv Detail & Related papers (2024-01-25T11:45:21Z) - A Pseudo-Semantic Loss for Autoregressive Models with Logical
Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning.
We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution.
We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z) - What Do Compressed Multilingual Machine Translation Models Forget? [102.50127671423752]
We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases.
We demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.
arXiv Detail & Related papers (2022-05-22T13:54:44Z) - Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix.
On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.