Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization
- URL: http://arxiv.org/abs/2506.11087v2
- Date: Sat, 27 Sep 2025 06:06:19 GMT
- Title: Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization
- Authors: Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen,
- Abstract summary: We introduce DeltaMix, an adaptive mixed-precision delta-compression framework to minimize quantization error in singular value decomposition space.<n>DeltaMix consistently outperforms all baseline methods on tasks such as AIME2024 and GQA.
- Score: 18.037215883638535
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, like multi-tenant serving, a large number of LLMs finetuned from the same base model are deployed to meet complex requirements for users. Recent works explore delta-compression approaches to quantize and compress the delta weights between the customized LLM and the corresponding base model. However, they exhibit inadequate performance at high compression ratios due to their empirical nature. In this work, we introduce DeltaMix, an adaptive mixed-precision delta-compression framework designed to minimize quantization error in the singular value decomposition (SVD) space without imposing additional assumptions. DeltaMix provides a theoretical justification for the necessity of mixed-precision compression and presents a practical quantization solution that involves solving a 0/1 linear integer programming problem alongside a reconstruction target correction method. Experimental results across multiple models and benchmarks illustrate that DeltaMix consistently outperforms all baseline methods. Notably, on tasks such as AIME2024 and GQA, DeltaMix exceeds the performance of the best baseline, Delta-CoMe, by 22.3\% and 6.1\% for 7B parameter models, respectively.
Related papers
- Delta-SVD: Efficient Compression for Personalized Text-to-Image Models [25.0585375727713]
We present Delta-SVD, a post-hoc, training-free compression method that targets the parameter weights update induced by DreamBooth fine-tuning.<n>We show that Delta-SVD achieves substantial compression with negligible loss in generation quality measured by CLIP score, SSIM and FID.
arXiv Detail & Related papers (2025-08-23T01:21:46Z) - Flexible Mixed Precision Quantization for Learned Image Compression [4.847449762378203]
We propose a Flexible Mixed Precision Quantization (FMPQ) method that assigns different bit-widths to different layers of the quantized network.<n>We also introduce an adaptive search algorithm which reduces the time-complexity of searching for the desired distribution of quantization bit-widths.
arXiv Detail & Related papers (2025-06-02T00:12:50Z) - Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression [57.71917274869577]
UltraDelta is a data-free delta compression pipeline that achieves both ultra-high compression and strong performance.<n>UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions.
arXiv Detail & Related papers (2025-05-19T10:37:22Z) - Dynamic Base model Shift for Delta Compression [53.505380509713575]
Delta compression attempts to lower the costs by reducing the redundancy of delta parameters.<n>Existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task.<n>We propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression.
arXiv Detail & Related papers (2025-05-16T15:11:19Z) - Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model [0.0]
Mix-QSAM is a mixed-precision Post-Training Quantization (PTQ) framework for the Segment Anything Model (SAM)<n>We introduce a layer-wise importance score, derived using Kullback-Leibler (KL) divergence, to quantify each layer's contribution to the model's output.<n>We also introduce cross-layer synergy, a novel metric based on causal mutual information, to capture dependencies between adjacent layers.
arXiv Detail & Related papers (2025-05-08T00:08:31Z) - ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs [9.435738597849447]
ImPart is a novel importance-aware delta sparsification approach.<n>It adjusts sparsity ratios of different singular vectors based on their importance.
arXiv Detail & Related papers (2025-04-17T16:39:36Z) - Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform [51.29604910007176]
We introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT)<n>The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
arXiv Detail & Related papers (2025-03-09T16:03:48Z) - Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization [17.501956455837707]
Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning.
Current methods that compress the delta weight struggle to achieve ultra-high compression.
We propose a novel distribution-driven delta compression framework DeltaDQ to achieve ultra-high compression for the delta weight.
arXiv Detail & Related papers (2024-10-11T09:44:16Z) - Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives.
Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions.
We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for
Pre-trained Language Models [90.24999406296867]
In contrast with the standard fine-tuning, delta tuning only fine-tunes a small portion of the model parameters while keeping the rest untouched.
Recent studies have demonstrated that a series of delta tuning methods with distinct tuned parameter selection could achieve performance on a par with full- parameter fine-tuning.
arXiv Detail & Related papers (2022-03-14T07:56:32Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.