COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
- URL: http://arxiv.org/abs/2509.22075v2
- Date: Mon, 06 Oct 2025 12:56:01 GMT
- Title: COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
- Authors: Dmitriy Shopkhoev, Denis Makhov, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis,
- Abstract summary: CoSpaDi is a training-free compression framework that replaces low-rank decomposition with a more flexible structured sparse factorization.<n>We evaluate CoSpaDi across multiple Llama and Qwen models under per-layer and per-group settings at 20-50% compression ratios.
- Score: 5.595343998068235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training compression of large language models (LLMs) largely relies on low-rank weight approximation, which represents each column of a weight matrix in a shared low-dimensional subspace. While this is a computationally efficient strategy, the imposed structural constraint is rigid and can lead to a noticeable model accuracy drop. In this work, we propose CoSpaDi (Compression via Sparse Dictionary Learning), a novel training-free compression framework that replaces low-rank decomposition with a more flexible structured sparse factorization in which each weight matrix is represented with a dense dictionary and a column-sparse coefficient matrix. This formulation enables a union-of-subspaces representation: different columns of the original weight matrix are approximated in distinct subspaces spanned by adaptively selected dictionary atoms, offering greater expressiveness than a single invariant basis. Crucially, CoSpaDi leverages a small calibration dataset to optimize the factorization such that the output activations of compressed projection layers closely match those of the original ones, thereby minimizing functional reconstruction error rather than mere weight approximation. This data-aware strategy preserves better model fidelity without any fine-tuning under reasonable compression ratios. Moreover, the resulting structured sparsity allows efficient sparse-dense matrix multiplication and is compatible with post-training quantization for further memory and latency gains. We evaluate CoSpaDi across multiple Llama and Qwen models under per-layer and per-group settings at 20-50\% compression ratios, demonstrating consistent superiority over state-of-the-art data-aware low-rank methods both in accuracy and perplexity. Our results establish structured sparse dictionary learning as a powerful alternative to conventional low-rank approaches for efficient LLM deployment.
Related papers
- COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression [5.280540253822294]
Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD)<n>We propose COMPOT, a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization.<n> COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines.
arXiv Detail & Related papers (2026-02-16T21:31:34Z) - WUSH: Near-Optimal Adaptive Transforms for LLM Quantization [52.77441224845925]
Quantization to low bitwidth is a standard approach for deploying large language models.<n>A few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer.<n>We derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization.
arXiv Detail & Related papers (2025-11-30T16:17:34Z) - Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z) - MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network [8.780976521229741]
Kolmogorov-Arnold Networks (KANs) replace scalar weights with per-edge vectors of basis coefficients.<n>We propose MetaCluster, a framework that makes KANs highly compressible without sacrificing accuracy.
arXiv Detail & Related papers (2025-10-21T21:58:15Z) - Optimizing Singular Spectrum for Large Language Model Compression [95.7621116637755]
We introduce SoCo, a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner.<n>Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores.<n> Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.
arXiv Detail & Related papers (2025-02-20T23:18:39Z) - Choose Your Model Size: Any Compression of Large Language Models Without Re-Computation [10.376875638696504]
This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off.<n>We use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty.<n>We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z) - CURing Large Models: Compression via CUR Decomposition [1.1510009152620668]
We introduce CURing, a novel model compression method based on CUR matrix decomposition.<n>By identifying and retaining informative rows and columns, CURing significantly reduces model size with minimal performance loss.<n>For example, it reduces Llama3.1-8B's parameters to 7.32B (-9%) in just 129 seconds, over 20 times faster than prior compression methods.
arXiv Detail & Related papers (2025-01-08T01:11:17Z) - Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.<n>We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Optimal Variable Clustering for High-Dimensional Matrix Valued Data [3.1138411427556445]
We propose a new latent variable model for the features arranged in matrix form.
Under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting.
We identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal.
arXiv Detail & Related papers (2021-12-24T02:13:04Z) - Dictionary-based Low-Rank Approximations and the Mixed Sparse Coding
problem [7.132368785057316]
I show how to adapt an efficient MSC solver based on the LASSO to compute Dictionary-based Matrix Factorization and Canonical Polyadic Decomposition.
I show how to adapt an efficient MSC solver based on the LASSO to compute Dictionary-based Matrix Factorization and Canonical Polyadic Decomposition in the context of hyperspectral image processing and chemometrics.
arXiv Detail & Related papers (2021-11-24T10:32:48Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.