Low-Rank Prune-And-Factorize for Language Model Compression
- URL: http://arxiv.org/abs/2306.14152v1
- Date: Sun, 25 Jun 2023 07:38:43 GMT
- Title: Low-Rank Prune-And-Factorize for Language Model Compression
- Authors: Siyu Ren, Kenny Q. Zhu
- Abstract summary: Matrix factorization fails to retain satisfactory performance under moderate to high compression rate.
We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
- Score: 18.088550230146247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The components underpinning PLMs -- large weight matrices -- were shown to
bear considerable redundancy. Matrix factorization, a well-established
technique from matrix theory, has been utilized to reduce the number of
parameters in PLM. However, it fails to retain satisfactory performance under
moderate to high compression rate. In this paper, we identify the
\textit{full-rankness} of fine-tuned PLM as the fundamental bottleneck for the
failure of matrix factorization and explore the use of network pruning to
extract low-rank sparsity pattern desirable to matrix factorization. We find
such low-rank sparsity pattern exclusively exists in models generated by
first-order pruning, which motivates us to unite the two approaches and achieve
more effective model compression. We further propose two techniques:
sparsity-aware SVD and mixed-rank fine-tuning, which improve the initialization
and training of the compression procedure, respectively. Experiments on GLUE
and question-answering tasks show that the proposed method has superior
compression-performance trade-off compared to existing approaches.
Related papers
- Optimizing Singular Spectrum for Large Language Model Compression [95.7621116637755]
We introduce SoCo, a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner.
Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores.
Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.
arXiv Detail & Related papers (2025-02-20T23:18:39Z) - Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)
ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.
We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z) - HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLMs [15.575498324678373]
A promising compression scheme is to decompose foundation models' dense weights into a sum of sparse plus low-rank matrices.
In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition.
arXiv Detail & Related papers (2025-02-02T20:23:32Z) - Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models [1.6385815610837167]
We present Pivoting Factorization (PIFA), a novel low-rank representation that unsupervisedly learns a compact form of any low-rank representation.
To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free low-rank reconstruction method.
MPIFA significantly outperforms existing low-rank pruning methods and, for the first time, achieves performance comparable to semi-structured pruning.
arXiv Detail & Related papers (2025-01-31T12:36:31Z) - Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method.
We propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation.
Our method can achieve a reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z) - Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [40.15915011575071]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.
We conduct empirical research on the low-rank characteristics of large models.
We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z) - LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models [9.244526043014098]
Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources.
In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure.
We propose a mixed compression model, which organically combines Low-Rank matrix And structured Pruning (LoRAP)
arXiv Detail & Related papers (2024-04-15T11:53:22Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement
Learning [53.445068584013896]
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure.
In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP.
We show that simple spectral-based matrix estimation approaches efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error.
arXiv Detail & Related papers (2023-10-10T17:06:41Z) - A Novel Maximum-Entropy-Driven Technique for Low-Rank Orthogonal
Nonnegative Matrix Factorization with $\ell_0$-Norm sparsity Constraint [0.0]
In data-driven control and machine learning, a common requirement involves breaking down large matrices into smaller, low-rank factors.
This paper introduces an innovative solution to the orthogonal nonnegative matrix factorization (ONMF) problem.
The proposed method achieves comparable or improved reconstruction errors in line with the literature.
arXiv Detail & Related papers (2022-10-06T04:30:59Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Enabling Lightweight Fine-tuning for Pre-trained Language Model
Compression based on Matrix Product Operators [31.461762905053426]
We present a novel pre-trained language models (PLM) compression approach based on the matrix product operator (short as MPO) from quantum many-body physics.
Our approach can be applied to the original or the compressed PLMs in a general way, which derives a lighter network and significantly reduces the parameters to be fine-tuned.
arXiv Detail & Related papers (2021-06-04T01:50:15Z) - Rank and run-time aware compression of NLP Applications [12.965657113072325]
This paper proposes a new compression technique called Hybrid Matrix Factorization.
It improves low-rank matrix factorization techniques by doubling the rank of the matrix.
It can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
arXiv Detail & Related papers (2020-10-06T16:03:15Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - Multi-View Spectral Clustering Tailored Tensor Low-Rank Representation [105.33409035876691]
This paper explores the problem of multi-view spectral clustering (MVSC) based on tensor low-rank modeling.
We design a novel structured tensor low-rank norm tailored to MVSC.
We show that the proposed method outperforms state-of-the-art methods to a significant extent.
arXiv Detail & Related papers (2020-04-30T11:52:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.