Related papers: Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

URL: http://arxiv.org/abs/2409.11233v1
Date: Tue, 17 Sep 2024 14:34:11 GMT
Title: Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models
Authors: Bishwash Khanal, Jeffery M. Capone,
Abstract summary: Large language models (LLMs) offer powerful capabilities but incur substantial computational costs. This study evaluates the impact of popular compression methods on the LLaMA-2-7B model. We show that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) offer powerful capabilities but incur substantial computational costs, driving the need for efficient compression techniques. This study evaluates the impact of popular compression methods - Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on the trade-offs between model size reduction, downstream task performance, and the role of calibration data. Our findings reveal that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks, highlighting the inadequacy of perplexity as the sole evaluation metric. To address this, we introduce Jensen-Shannon (JS) Divergence as a more comprehensive metric that captures nuanced changes in model behavior post-compression. We further demonstrate that task-specific calibration data significantly enhances the downstream performance of compressed models compared to general calibration data. This research underscores the necessity for diverse evaluation metrics and careful calibration data selection to fully understand the complexities of LLM compression and its implications for practical applications.

Related papers

Unified Scaling Laws for Compressed Representations [69.72517034565467]
We investigate whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations.<n>Our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric.<n>We extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
arXiv Detail & Related papers (2025-06-02T16:52:51Z)
Optimizing Singular Spectrum for Large Language Model Compression [95.7621116637755]
We introduce SoCo, a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner. Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores. Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.
arXiv Detail & Related papers (2025-02-20T23:18:39Z)
Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)<n>ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.<n>We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
Beware of Calibration Data for Pruning Large Language Models [41.1689082093302]
Post-training pruning is a promising method that does not require resource-intensive iterative training. We show that the effects of calibration data even value more than designing advanced pruning strategies. Our preliminary exploration also discloses that using calibration data similar to the training data can yield better performance.
arXiv Detail & Related papers (2024-10-23T09:36:21Z)
Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting. We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment. Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits. Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy. More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z)
On the Impact of Calibration Data in Post-training Quantization and Pruning [36.1039389951318]
Quantization and pruning form the foundation of compression for neural networks. We present the first empirical study on the effect of calibration data upon model compression methods.
arXiv Detail & Related papers (2023-11-16T10:30:00Z)
Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision. Model compression detrimentally impacts the performance of visual prompting-based transfer. However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
Language model compression with weighted low-rank factorization [73.61874728240568]
We introduce Fisher information to weigh the importance of parameters affecting the model prediction. We find that our resulting task accuracy is much closer to the original model's performance. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies.
arXiv Detail & Related papers (2022-06-30T21:57:07Z)
What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning. We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets. We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.