Low-Rank Prune-And-Factorize for Language Model Compression
- URL: http://arxiv.org/abs/2306.14152v1
- Date: Sun, 25 Jun 2023 07:38:43 GMT
- Title: Low-Rank Prune-And-Factorize for Language Model Compression
- Authors: Siyu Ren, Kenny Q. Zhu
- Abstract summary: Matrix factorization fails to retain satisfactory performance under moderate to high compression rate.
We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
- Score: 18.088550230146247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The components underpinning PLMs -- large weight matrices -- were shown to
bear considerable redundancy. Matrix factorization, a well-established
technique from matrix theory, has been utilized to reduce the number of
parameters in PLM. However, it fails to retain satisfactory performance under
moderate to high compression rate. In this paper, we identify the
\textit{full-rankness} of fine-tuned PLM as the fundamental bottleneck for the
failure of matrix factorization and explore the use of network pruning to
extract low-rank sparsity pattern desirable to matrix factorization. We find
such low-rank sparsity pattern exclusively exists in models generated by
first-order pruning, which motivates us to unite the two approaches and achieve
more effective model compression. We further propose two techniques:
sparsity-aware SVD and mixed-rank fine-tuning, which improve the initialization
and training of the compression procedure, respectively. Experiments on GLUE
and question-answering tasks show that the proposed method has superior
compression-performance trade-off compared to existing approaches.
Related papers
- Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [40.15915011575071]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.
We conduct empirical research on the low-rank characteristics of large models.
We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z) - LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models [9.244526043014098]
Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources.
In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure.
We propose a mixed compression model, which organically combines Low-Rank matrix And structured Pruning (LoRAP)
arXiv Detail & Related papers (2024-04-15T11:53:22Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement
Learning [53.445068584013896]
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure.
In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP.
We show that simple spectral-based matrix estimation approaches efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error.
arXiv Detail & Related papers (2023-10-10T17:06:41Z) - A Novel Maximum-Entropy-Driven Technique for Low-Rank Orthogonal
Nonnegative Matrix Factorization with $\ell_0$-Norm sparsity Constraint [0.0]
In data-driven control and machine learning, a common requirement involves breaking down large matrices into smaller, low-rank factors.
This paper introduces an innovative solution to the orthogonal nonnegative matrix factorization (ONMF) problem.
The proposed method achieves comparable or improved reconstruction errors in line with the literature.
arXiv Detail & Related papers (2022-10-06T04:30:59Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Enabling Lightweight Fine-tuning for Pre-trained Language Model
Compression based on Matrix Product Operators [31.461762905053426]
We present a novel pre-trained language models (PLM) compression approach based on the matrix product operator (short as MPO) from quantum many-body physics.
Our approach can be applied to the original or the compressed PLMs in a general way, which derives a lighter network and significantly reduces the parameters to be fine-tuned.
arXiv Detail & Related papers (2021-06-04T01:50:15Z) - Rank and run-time aware compression of NLP Applications [12.965657113072325]
This paper proposes a new compression technique called Hybrid Matrix Factorization.
It improves low-rank matrix factorization techniques by doubling the rank of the matrix.
It can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
arXiv Detail & Related papers (2020-10-06T16:03:15Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - Multi-View Spectral Clustering Tailored Tensor Low-Rank Representation [105.33409035876691]
This paper explores the problem of multi-view spectral clustering (MVSC) based on tensor low-rank modeling.
We design a novel structured tensor low-rank norm tailored to MVSC.
We show that the proposed method outperforms state-of-the-art methods to a significant extent.
arXiv Detail & Related papers (2020-04-30T11:52:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.