Related papers: Layer-wise dynamic rank for compressing large language models

Layer-wise dynamic rank for compressing large language models

URL: http://arxiv.org/abs/2509.25622v2
Date: Sat, 04 Oct 2025 02:57:59 GMT
Title: Layer-wise dynamic rank for compressing large language models
Authors: Zhendong Mi, Bian Sun, Grace Li Zhang, Shaoyi Huang,
Abstract summary: Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges.<n>We propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression.
Score: 2.9416461160070955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges that hinder their deployment. Singular Value Decomposition (SVD)-based compression has emerged as an appealing post-training compression technique for LLMs, yet most existing methods apply a uniform compression ratio across all layers, implicitly assuming homogeneous information included in various layers. This overlooks the substantial intra-layer heterogeneity observed in LLMs, where middle layers tend to encode richer information while early and late layers are more redundant. In this work, we revisit the existing SVD-based compression method and propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression. We first introduce effective rank as a principled metric to measure the information density of weight matrices, and then allocate ranks via a Lagrange multiplier-based optimization scheme to adaptively assign more capacity to groups with higher information density under a fixed compression ratio. Moreover, we rebalance the allocated ranks across attention layers to account for their varying importance and extend D-Rank to latest LLMs with grouped-query attention. Extensive experiments on various LLMs with different scales across multiple compression ratios demonstrate that D-Rank consistently outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40% compression ratio while achieving even higher throughput.

Related papers

Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z)
FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference [7.784124271824854]
Large language models (LLM) have enormous counts hinder deployment on resource-constrained hardware.<n>Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation.<n>To address these issues, we propose the Fine-grained Low-Rank parameter (FLRC) which efficiently determines an optimal rank allocation for each layer.
arXiv Detail & Related papers (2025-10-10T12:35:09Z)
Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM [11.762499172999886]
Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment.<n>We present a novel low-rank compression framework to address this challenge.
arXiv Detail & Related papers (2025-10-07T03:07:47Z)
MGAA: Multi-Granular Adaptive Allocation fof Low-Rank Compression of LLMs [9.244526043014098]
Multi-Granular Adaptive Allocation (MGAA) method can adaptively allocate parameters between and within sublayers without task-specific evaluations in the compression process.<n> Comprehensive evaluations of MGAA across multiple LLMs backbone models and benchmark datasets demonstrate its superior performance.
arXiv Detail & Related papers (2025-07-04T04:54:01Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression [5.206085750261924]
Large Language Models (LLMs) require significant amount of memory storage in inference. In this paper, we take a step further to explore parameter sharing across different layers with singular value decomposition. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches.
arXiv Detail & Related papers (2024-10-02T14:30:02Z)
Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z)
From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications [85.17672240603011]
We study the non-uniform low-rank properties of weight matrices in Large Language Models.<n>We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning into one.
arXiv Detail & Related papers (2024-07-15T21:05:20Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [42.53133823994923]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.<n>We conduct empirical research on the low-rank characteristics of large models.<n>We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.