Related papers: DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization

URL: http://arxiv.org/abs/2410.08666v1
Date: Fri, 11 Oct 2024 09:44:16 GMT
Title: DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization
Authors: Yanfeng Jiang, Zelan Yang, Bohua Chen, Shen Li, Yong Li, Tao Li,
Abstract summary: Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. Current methods that compress the delta weight struggle to achieve ultra-high compression. We propose a novel distribution-driven delta compression framework DeltaDQ to achieve ultra-high compression for the delta weight.
Score: 17.501956455837707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. However, the diversity of downstream tasks and practical requirements makes deploying multiple full-parameter fine-tuned models challenging. Current methods that compress the delta weight struggle to achieve ultra-high compression, failing to minimize the deployment overhead. To address the above issue, we propose a novel distribution-driven delta compression framework DeltaDQ, which utilizes Group-wise Dropout and Separate Quantization to achieve ultra-high compression for the delta weight. We have observed that the matrix-computed intermediate results for the delta weight exhibit extremely small variance and min-max range characteristics, referred to as Balanced Intermediate Results. Exploiting this phenomenon, we introduce Group-wise Dropout to perform dropout on the delta weight using an optimal group size. Furthermore, using Separate Quantization, sparse weights are quantized and decomposed to achieve a lower bit. Experimental results show that DeltaDQ achieves 16x compression with improved accuracy compared to baselines for WizardMath and WizardCoder models across different parameter scales. Moreover, DeltaDQ demonstrates the ability for ultra-high compression ratio, achieving 128x compression for the WizardMath-7B model and 512x compression for the WizardMath-70B model.

Related papers

ADAMIX: Adaptive Mixed-Precision Delta-Compression with Quantization Error Optimization for Large Language Models [14.975251449732175]
Large language models (LLMs) achieve impressive performance on various knowledge-intensive and complex reasoning tasks.<n>Recent works explore delta-compression approaches to quantize and compress the delta parameters between the customized LLM and the corresponding base model.<n>We propose ADAmix, an effective adaptive mixed-precision delta-compression framework.
arXiv Detail & Related papers (2025-06-05T08:17:12Z)
Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression [53.08742231761896]
UltraDelta is a data-free delta compression pipeline that achieves both ultra-high compression and strong performance.<n>UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions.
arXiv Detail & Related papers (2025-05-19T10:37:22Z)
Dynamic Base model Shift for Delta Compression [53.505380509713575]
Delta compression attempts to lower the costs by reducing the redundancy of delta parameters.<n>Existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task.<n>We propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression.
arXiv Detail & Related papers (2025-05-16T15:11:19Z)
ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs [9.435738597849447]
ImPart is a novel importance-aware delta sparsification approach. It adjusts sparsity ratios of different singular vectors based on their importance.
arXiv Detail & Related papers (2025-04-17T16:39:36Z)
Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform [51.29604910007176]
We introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT) The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
arXiv Detail & Related papers (2025-03-09T16:03:48Z)
Delta Decompression for MoE-based LLMs Compression [22.144081182788394]
$D2$-MoE is a new delta decompression compressor for reducing the parameters of MoE LLMs. We decompose their weights into a shared base weight and unique delta weights. Experiments highlight the superiority of our approach, with over 13% performance gains.
arXiv Detail & Related papers (2025-02-24T16:32:22Z)
Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP) ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run. We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [79.56709262189953]
EoRA consistently outperforms previous methods in compensating errors for compressed LLaMA2/3 models on various tasks. EoRA offers a scalable, training-free solution to compensate for compression errors.
arXiv Detail & Related papers (2024-10-28T17:59:03Z)
BitDelta: Your Fine-Tune May Only Be Worth One Bit [57.558376557639555]
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x.
arXiv Detail & Related papers (2024-02-15T18:50:06Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs [7.1597349516197655]
Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns. We present DeltaZip, an LLM serving system that efficiently serves multiple full- parameter fine-tuned models concurrently.
arXiv Detail & Related papers (2023-12-08T18:07:05Z)
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment. Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits. Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy. More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z)
Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization [5.648270790530862]
State-of-the-art approaches involve lossy model compression mechanisms, which induce a tradeoff between the resulting model quality (accuracy) and compression ratio. We make a key enabling observation that the sensitivity of model weights to compression varies during training, and different weights benefit from different quantization levels. We propose a non-uniform quantization scheme that leverages this variation, an efficient search mechanism that dynamically finds the best quantization configurations, and a quantization-aware delta compression mechanism that rearranges weights to minimize checkpoint differences.
arXiv Detail & Related papers (2023-06-20T18:00:31Z)
Compressing Transformer-based self-supervised models for speech processing [45.254624876127124]
We study several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report trade-off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations. Our results lead to a simple combination of compression techniques that improves trade-off over recent approaches.
arXiv Detail & Related papers (2022-11-17T23:53:52Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning. We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets. We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z)
Self-Supervised GAN Compression [32.21713098893454]
We show that a standard model compression technique, weight pruning, cannot be applied to GANs using existing methods. We then develop a self-supervised compression technique which uses the trained discriminator to supervise the training of a compressed generator. We show that this framework has a compelling performance to high degrees of sparsity, can be easily applied to new tasks and models, and enables meaningful comparisons between different pruning granularities.
arXiv Detail & Related papers (2020-07-03T04:18:54Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.