Related papers: Hierarchical Sparse Plus Low Rank Compression of LLM

Hierarchical Sparse Plus Low Rank Compression of LLM

URL: http://arxiv.org/abs/2601.07839v1
Date: Fri, 19 Dec 2025 04:28:30 GMT
Title: Hierarchical Sparse Plus Low Rank Compression of LLM
Authors: Pawan Kumar, Aditi Gupta,
Abstract summary: We present Hierarchical Sparse Plus Low-Rank (HSS) compression, a two-stage scheme that removes the largest-magnitude weights into a sparse matrix S.<n>HSS is hardware-friendly: its matrix-vector multiply reduces to one sparse and a sequence of thin-matrix multiplications.<n>Experiments on LLaMA-7B show that targeting only the self-attention projections suffices to yield large memory savings.
Score: 2.4311207322523023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern large language models (LLMs) place extraordinary pressure on memory and compute budgets, making principled compression indispensable for both deployment and continued training. We present Hierarchical Sparse Plus Low-Rank (HSS) compression, a two-stage scheme that (i) removes the largest-magnitude weights into a sparse matrix S and (ii) applies a recursive Hierarchically Sparse Separable (HSS) low-rank factorisation to the dense residual matrix. A recursive rank-reducing strategy and a reverse Cuthill-Mckee (RCM) permutation are introduced to align high weights towards the diagonal with the block-diagonal hierarchy, maximising off-diagonal compressibility (because they are touched only once). HSS is hardware-friendly: its matrix-vector multiply reduces to one sparse and a sequence of thin-matrix multiplications and can be trained end-to-end with standard optimisers. Experiments on LLaMA-7B show that targeting only the self-attention projections (1.6 B parameters of Q, K, and V matrices out of a total 7B parameters) suffices to yield large memory savings while retaining comparable state-of-the-art perplexity scores on test samples of the WikiText dataset. For example, with a 30\% sparsity budget and an outer rank of 512, sHSS-RCM achieves a perplexity of 1.64, outperforming dense baselines and classical sparse-plus-SVD variants, while also achieving significant memory savings.

Related papers

Zero Sum SVD: Balancing Loss Sensitivity for Low Rank LLM Compression [11.908793753919745]
We propose textbfZero Sum SVD (textbfZS-SVD), a post-training method that performs singular component selection in whitened coordinates.<n>textbfZS-SVD prunes components across the whole model with a textbfzero sum rule that keeps the cumulative predicted loss change near zero.<n>Experiments show consistent gains across diverse benchmarks and compression ratios.
arXiv Detail & Related papers (2026-02-02T21:51:01Z)
Concatenated Matrix SVD: Compression Bounds, Incremental Approximation, and Error-Constrained Clustering [0.0]
We propose three clustering algorithms that merge matrices only when their predicted joint SVD compression error remains below a user-specified threshold.<n>The algorithms span a trade-off between speed, provable accuracy, and scalability, enabling compression-aware clustering with explicit error control.
arXiv Detail & Related papers (2026-01-12T18:15:53Z)
COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning [5.595343998068235]
CoSpaDi is a training-free compression framework that replaces low-rank decomposition with a more flexible structured sparse factorization.<n>We evaluate CoSpaDi across multiple Llama and Qwen models under per-layer and per-group settings at 20-50% compression ratios.
arXiv Detail & Related papers (2025-09-26T08:55:09Z)
FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models [49.397861654088636]
We propose a two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces.<n>We show that our strategy achieves faster runtime and reduced memory usage by up to $25%$ across different model sizes.
arXiv Detail & Related papers (2025-05-23T14:37:00Z)
BOLT: Block-Orthonormal Lanczos for Trace estimation of matrix functions [2.4578723416255754]
In many large-scale applications, the matrices involved are too large to store or access in full, making a single mat-vec product infeasible.<n>We introduce Subblock SLQ, a variant of BOLT that operates only on small principal submatrices.<n>We provide theoretical guarantees and demonstrate strong empirical performance across a range of high-dimensional settings.
arXiv Detail & Related papers (2025-05-18T08:04:05Z)
Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications [85.17672240603011]
We study the non-uniform low-rank properties of weight matrices in Large Language Models.<n>We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning into one.
arXiv Detail & Related papers (2024-07-15T21:05:20Z)
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z)
Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate. We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
A Generic Network Compression Framework for Sequential Recommender Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations. We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed. By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.