Beware of Calibration Data for Pruning Large Language Models
- URL: http://arxiv.org/abs/2410.17711v2
- Date: Sun, 29 Jun 2025 07:21:34 GMT
- Title: Beware of Calibration Data for Pruning Large Language Models
- Authors: Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang,
- Abstract summary: Post-training pruning is a promising method that does not require resource-intensive iterative training.<n> calibration data is also crucial to post-training pruning, especially for high sparsity.<n>We provide a self-generating calibration data synthesis strategy to construct feasible calibration data.
- Score: 41.1689082093302
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As large language models (LLMs) are widely applied across various fields, model compression has become increasingly crucial for reducing costs and improving inference efficiency. Post-training pruning is a promising method that does not require resource-intensive iterative training and only needs a small amount of calibration data to assess the importance of parameters. Recent research has enhanced post-training pruning from different aspects but few of them systematically explore the effects of calibration data, and it is unclear if there exist better calibration data construction strategies. We fill this blank and surprisingly observe that calibration data is also crucial to post-training pruning, especially for high sparsity. Through controlled experiments on important influence factors of calibration data, including the pruning settings, the amount of data, and its similarity with pre-training data, we observe that a small size of data is adequate, and more similar data to its pre-training stage can yield better performance. As pre-training data is usually inaccessible for advanced LLMs, we further provide a self-generating calibration data synthesis strategy to construct feasible calibration data. Experimental results on recent strong open-source LLMs (e.g., DCLM, and LLaMA-3) show that the proposed strategy can enhance the performance of strong pruning methods (e.g., Wanda, DSnoT, OWL) by a large margin (up to $2.68\%$). Code is available at https://github.com/Dereck0602/calibration_data.
Related papers
- Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information [2.133855532092057]
We propose an effective data reduction strategy based on Pointwise - Information (PVI)<n>Experiments show that the classifier performance is maintained with only a 0.0001% to 0.76% reduction in accuracy when 10%-30% of the data is removed.<n>We have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese NLP tasks and base models.
arXiv Detail & Related papers (2025-06-19T06:59:19Z) - What Really Matters for Learning-based LiDAR-Camera Calibration [50.2608502974106]
This paper revisits the development of learning-based LiDAR-Camera calibration.
We identify the critical limitations of regression-based methods with the widely used data generation pipeline.
We also investigate how the input data format and preprocessing operations impact network performance.
arXiv Detail & Related papers (2025-01-28T14:12:32Z) - Curriculum-style Data Augmentation for LLM-based Metaphor Detection [7.4594050203808395]
We propose a method for metaphor detection by fine-tuning open-source LLMs.
Our method achieves state-of-the-art performance across all baselines.
arXiv Detail & Related papers (2024-12-04T02:05:21Z) - Using Large Language Models for Expert Prior Elicitation in Predictive Modelling [53.54623137152208]
This study proposes using large language models (LLMs) to elicit expert prior distributions for predictive models.
We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation.
Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings.
arXiv Detail & Related papers (2024-11-26T10:13:39Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning [56.795078085234195]
LLM pruning approaches universally rely on the C4 dataset as the calibration data for calculating pruning scores.
In this study, we evaluate the choice of calibration data on LLM pruning, across a wide range of datasets.
Our results also uncover several subtle and often unexpected findings.
arXiv Detail & Related papers (2024-10-09T22:00:19Z) - Fill In The Gaps: Model Calibration and Generalization with Synthetic Data [2.89287673224661]
We propose a calibration method that incorporates synthetic data without compromising accuracy.
We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework.
We observed an average up to 34% increase in accuracy and 33% decrease in ECE.
arXiv Detail & Related papers (2024-10-07T23:06:42Z) - Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models [0.0]
Large language models (LLMs) offer powerful capabilities but incur substantial computational costs.
This study evaluates the impact of popular compression methods on the LLaMA-2-7B model.
We show that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks.
arXiv Detail & Related papers (2024-09-17T14:34:11Z) - Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We present a framework that selects high-quality pretraining data without any LLM training of our own.<n>We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.<n>Our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent.
We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales.
Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - On the Impact of Calibration Data in Post-training Quantization and Pruning [36.1039389951318]
Quantization and pruning form the foundation of compression for neural networks.
We present the first empirical study on the effect of calibration data upon model compression methods.
arXiv Detail & Related papers (2023-11-16T10:30:00Z) - When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - From calibration to parameter learning: Harnessing the scaling effects
of big data in geoscientific modeling [2.9897531698031403]
We propose a differentiable parameter learning framework that efficiently learns a global mapping between inputs and parameters.
As training data increases, dPL achieves better performance, more physical coherence, and better generalizability.
We demonstrate examples that learned from soil moisture and streamflow, where dPL drastically outperformed existing evolutionary and regionalization methods.
arXiv Detail & Related papers (2020-07-30T21:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.