From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications
- URL: http://arxiv.org/abs/2407.11239v2
- Date: Sat, 07 Jun 2025 06:59:36 GMT
- Title: From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications
- Authors: Ajay Jaiswal, Yifan Wang, Lu Yin, Shiwei Liu, Runjin Chen, Jiawei Zhao, Ananth Grama, Yuandong Tian, Zhangyang Wang,
- Abstract summary: We study the non-uniform low-rank properties of weight matrices in Large Language Models.<n>We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning into one.
- Score: 85.17672240603011
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models' (LLMs) weight matrices can often be expressed in low-rank form with potential to relax memory and compute resource requirements. Unlike prior efforts that focus on developing novel matrix decompositions, in this work we study the non-uniform low-rank properties of weight matrices in LLMs through the lens of stabilizing gradient subspace. First, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Second, we empirically establish an important relationship between gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structures, necessitating variable rank reduction across them to minimize drop in performance due to compression. Drawing on this result, we present Weight Low-Rank Projection(WeLore) that unifies weight compression and memory-efficient fine-tuning into one, in a data-agnostic and one-shot manner. When used as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) and suitably encodes them for minimum performance loss. Our gradient dynamics perspective illustrates that LRCs tend to have better fine-tuning capabilities and their standalone fine-tuning can closely mimic and sometimes outperform the training loss trajectory and performance of full fine-tuning with notable memory and compute footprint reduction. Codes are available at https://github.com/VITA-Group/WeLore.
Related papers
- QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z) - Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition [4.119890956388359]
We introduce Outlier-Driven Low-Rank Initialization (ODLRI) which assigns low-rank components the specific role of capturing activation-sensitive weights.<n>Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that ODLRI consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.
arXiv Detail & Related papers (2025-06-02T09:15:13Z) - HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLMs [15.575498324678373]
A promising compression scheme is to decompose foundation models' dense weights into a sum of sparse plus low-rank matrices.
In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition.
arXiv Detail & Related papers (2025-02-02T20:23:32Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs [28.70239743254508]
We present the first structural binarization method for LLM compression to less than 1-bit precision.
We observe that some weights in binarized LLMs can be randomly flipped without significant performance degradation.
Our approach performs better than other compressed binarization methods while significantly reducing memory requirements.
arXiv Detail & Related papers (2024-08-03T15:07:44Z) - Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients [86.40635601953446]
We introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection.
We demonstrate that Q-Galore achieves highly competitive performance with exceptional memory efficiency.
arXiv Detail & Related papers (2024-07-11T08:42:58Z) - OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning [18.102930806071978]
Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore) is a memory-efficient fine-tuning approach.
OwLore consistently outperforms baseline approaches, including full fine-tuning.
arXiv Detail & Related papers (2024-05-28T17:22:22Z) - Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [40.15915011575071]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.
We conduct empirical research on the low-rank characteristics of large models.
We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z) - LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models [9.244526043014098]
Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources.
In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure.
We propose a mixed compression model, which organically combines Low-Rank matrix And structured Pruning (LoRAP)
arXiv Detail & Related papers (2024-04-15T11:53:22Z) - Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.<n>We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - LoTR: Low Tensor Rank Weight Adaptation [47.4904143988667]
We introduce LoTR, a novel approach for parameter-efficient fine-tuning of large language models (LLMs)
LoTR represents a gradient update to parameters in a form of tensor decomposition.
Simultaneous compression of a sequence of layers with low-rank tensor representation allows LoTR to archive even better parameter efficiency then LoRA especially for deep models.
arXiv Detail & Related papers (2024-02-02T13:00:38Z) - LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component.
Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z) - LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot
Compression [16.901290551711476]
We study the potential to compress Large Language Models (LLMs) for monolingual Code generation via Low Rank Decomposition (LoRD)
We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100.
arXiv Detail & Related papers (2023-09-25T10:35:17Z) - LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z) - Spectral Tensor Train Parameterization of Deep Learning Layers [136.4761580842396]
We study low-rank parameterizations of weight matrices with embedded spectral properties in the Deep Learning context.
We show the effects of neural network compression in the classification setting and both compression and improved stability training in the generative adversarial training setting.
arXiv Detail & Related papers (2021-03-07T00:15:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.