Beware of Calibration Data for Pruning Large Language Models
- URL: http://arxiv.org/abs/2410.17711v1
- Date: Wed, 23 Oct 2024 09:36:21 GMT
- Title: Beware of Calibration Data for Pruning Large Language Models
- Authors: Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang,
- Abstract summary: Post-training pruning is a promising method that does not require resource-intensive iterative training.
We show that the effects of calibration data even value more than designing advanced pruning strategies.
Our preliminary exploration also discloses that using calibration data similar to the training data can yield better performance.
- Score: 41.1689082093302
- License:
- Abstract: As large language models (LLMs) are widely applied across various fields, model compression has become increasingly crucial for reducing costs and improving inference efficiency. Post-training pruning is a promising method that does not require resource-intensive iterative training and only needs a small amount of calibration data to assess the importance of parameters. Previous research has primarily focused on designing advanced pruning methods, while different calibration data's impact on pruning performance still lacks systematical exploration. We fill this blank and surprisingly observe that the effects of calibration data even value more than designing advanced pruning strategies, especially for high sparsity. Our preliminary exploration also discloses that using calibration data similar to the training data can yield better performance. As pre-training data is usually inaccessible for advanced LLMs, we further provide a self-generating calibration data synthesis strategy to construct feasible calibration data. We conduct experiments on the recent strong open-source LLMs (e.g., DCLM, and LLaMA-3), and the results show that the proposed method outperforms commonly used calibration data and can effectively enhance strong pruning methods (e.g., Wanda, OWL).
Related papers
- Using Large Language Models for Expert Prior Elicitation in Predictive Modelling [53.54623137152208]
This study proposes using large language models (LLMs) to elicit expert prior distributions for predictive models.
We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation.
Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings.
arXiv Detail & Related papers (2024-11-26T10:13:39Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Fill In The Gaps: Model Calibration and Generalization with Synthetic Data [2.89287673224661]
We propose a calibration method that incorporates synthetic data without compromising accuracy.
We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework.
We observed an average up to 34% increase in accuracy and 33% decrease in ECE.
arXiv Detail & Related papers (2024-10-07T23:06:42Z) - Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models [0.0]
Large language models (LLMs) offer powerful capabilities but incur substantial computational costs.
This study evaluates the impact of popular compression methods on the LLaMA-2-7B model.
We show that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks.
arXiv Detail & Related papers (2024-09-17T14:34:11Z) - AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent.
We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales.
Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - On the Impact of Calibration Data in Post-training Quantization and Pruning [36.1039389951318]
Quantization and pruning form the foundation of compression for neural networks.
We present the first empirical study on the effect of calibration data upon model compression methods.
arXiv Detail & Related papers (2023-11-16T10:30:00Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - From calibration to parameter learning: Harnessing the scaling effects
of big data in geoscientific modeling [2.9897531698031403]
We propose a differentiable parameter learning framework that efficiently learns a global mapping between inputs and parameters.
As training data increases, dPL achieves better performance, more physical coherence, and better generalizability.
We demonstrate examples that learned from soil moisture and streamflow, where dPL drastically outperformed existing evolutionary and regionalization methods.
arXiv Detail & Related papers (2020-07-30T21:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.