Layer-wise dynamic rank for compressing large language models
- URL: http://arxiv.org/abs/2509.25622v2
- Date: Sat, 04 Oct 2025 02:57:59 GMT
- Title: Layer-wise dynamic rank for compressing large language models
- Authors: Zhendong Mi, Bian Sun, Grace Li Zhang, Shaoyi Huang,
- Abstract summary: Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges.<n>We propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression.
- Score: 2.9416461160070955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges that hinder their deployment. Singular Value Decomposition (SVD)-based compression has emerged as an appealing post-training compression technique for LLMs, yet most existing methods apply a uniform compression ratio across all layers, implicitly assuming homogeneous information included in various layers. This overlooks the substantial intra-layer heterogeneity observed in LLMs, where middle layers tend to encode richer information while early and late layers are more redundant. In this work, we revisit the existing SVD-based compression method and propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression. We first introduce effective rank as a principled metric to measure the information density of weight matrices, and then allocate ranks via a Lagrange multiplier-based optimization scheme to adaptively assign more capacity to groups with higher information density under a fixed compression ratio. Moreover, we rebalance the allocated ranks across attention layers to account for their varying importance and extend D-Rank to latest LLMs with grouped-query attention. Extensive experiments on various LLMs with different scales across multiple compression ratios demonstrate that D-Rank consistently outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40% compression ratio while achieving even higher throughput.
Related papers
- Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference [7.784124271824854]
Large language models (LLM) have enormous counts hinder deployment on resource-constrained hardware.<n>Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation.<n>To address these issues, we propose the Fine-grained Low-Rank parameter (FLRC) which efficiently determines an optimal rank allocation for each layer.
arXiv Detail & Related papers (2025-10-10T12:35:09Z) - Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM [11.762499172999886]
Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment.<n>We present a novel low-rank compression framework to address this challenge.
arXiv Detail & Related papers (2025-10-07T03:07:47Z) - MGAA: Multi-Granular Adaptive Allocation fof Low-Rank Compression of LLMs [9.244526043014098]
Multi-Granular Adaptive Allocation (MGAA) method can adaptively allocate parameters between and within sublayers without task-specific evaluations in the compression process.<n> Comprehensive evaluations of MGAA across multiple LLMs backbone models and benchmark datasets demonstrate its superior performance.
arXiv Detail & Related papers (2025-07-04T04:54:01Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression [5.206085750261924]
Large Language Models (LLMs) require significant amount of memory storage in inference.
In this paper, we take a step further to explore parameter sharing across different layers with singular value decomposition.
Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches.
arXiv Detail & Related papers (2024-10-02T14:30:02Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications [85.17672240603011]
We study the non-uniform low-rank properties of weight matrices in Large Language Models.<n>We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning into one.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [42.53133823994923]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.<n>We conduct empirical research on the low-rank characteristics of large models.<n>We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z) - Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models.
LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods.
LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.