LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot
Compression
- URL: http://arxiv.org/abs/2309.14021v1
- Date: Mon, 25 Sep 2023 10:35:17 GMT
- Title: LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot
Compression
- Authors: Ayush Kaushal, Tejas Vaidhya, Irina Rish
- Abstract summary: We study the potential to compress Large Language Models (LLMs) for monolingual Code generation via Low Rank Decomposition (LoRD)
We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100.
- Score: 16.901290551711476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low Rank Decomposition of matrix - splitting a large matrix into a product of
two smaller matrix offers a means for compression that reduces the parameters
of a model without sparsification, and hence delivering more speedup on modern
hardware. Moreover, unlike quantization, the compressed linear layers remain
fully differentiable and all the parameters trainable, while being able to
leverage the existing highly efficient kernels over floating point matrices. We
study the potential to compress Large Language Models (LLMs) for monolingual
Code generation via Low Rank Decomposition (LoRD) and observe that ranks for
the linear layers in these models can be reduced by upto 39.58% with less than
1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to
compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with
minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single
A100. The compressed models speeds up inference by up to 22.35% with just a
single line of change in code over huggingface's implementation with pytorch
backend. Low Rank Decomposition (LoRD) models remain compatible with state of
the art near-lossless quantization method such as SpQR, which allows leveraging
further compression gains of quantization. Lastly, QLoRA over Low Rank
Decomposition (LoRD) model further reduces memory requirements by as much as
21.2% over vanilla QLoRA while offering similar gains from parameter efficient
fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new
paradigm for LLM compression.
Related papers
- Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models.
We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [40.15915011575071]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.
We conduct empirical research on the low-rank characteristics of large models.
We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z) - LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component.
Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z) - Training Acceleration of Low-Rank Decomposed Networks using Sequential
Freezing and Rank Quantization [5.914653351242832]
We propose two techniques for accelerating low rank decomposed models without requiring to use small ranks for decomposition.
These methods include rank optimization and sequential freezing of layers.
Experiments show that these techniques can improve the model throughput up to 60% during training and 37% during inference when combined together.
arXiv Detail & Related papers (2023-09-07T16:33:42Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.