Related papers: Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

URL: http://arxiv.org/abs/2510.10618v1
Date: Sun, 12 Oct 2025 14:00:23 GMT
Title: Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization
Authors: Bowei He, Lihao Yin, Huiling Zhen, Shuqi Liu, Han Wu, Xiaokun Zhang, Mingxuan Yuan, Chen Ma,
Abstract summary: Post-training compression is used to scale down large language model (LLM)<n> calibration data plays a vital role by informing the weight importance and activation dynamic ranges.<n>We explore the calibration data's impacts on high-level complex reasoning capabilities, like math problem solving and code generation.
Score: 29.190967827961142
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training compression has been a widely employed approach to scale down large language model (LLM) and facilitate efficient inference. In various proposed compression methods, including pruning and quantization, calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the LLM capability after compression is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different LLM capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data's impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training compression methods on preserving critical LLM capabilities. Our code is provided in \href{https://github.com/BokwaiHo/COLA.git}{Link}.

Related papers

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [65.23593936798662]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles [4.477423478591491]
Calib-n is a novel framework that trains an auxiliary model for confidence estimation.<n>We find that few-shot prompts are the most effective for auxiliary model-based methods.
arXiv Detail & Related papers (2025-01-07T18:48:42Z)
Beware of Calibration Data for Pruning Large Language Models [41.1689082093302]
Post-training pruning is a promising method that does not require resource-intensive iterative training.<n> calibration data is also crucial to post-training pruning, especially for high sparsity.<n>We provide a self-generating calibration data synthesis strategy to construct feasible calibration data.
arXiv Detail & Related papers (2024-10-23T09:36:21Z)
Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models [0.0]
Large language models (LLMs) offer powerful capabilities but incur substantial computational costs. This study evaluates the impact of popular compression methods on the LLaMA-2-7B model. We show that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks.
arXiv Detail & Related papers (2024-09-17T14:34:11Z)
Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs [27.38239289662178]
Post-Training Quantization (PTQ) enhances the efficiency of Large Language Models (LLMs) We explore the role of calibration sets in PTQ, specifically their effect on hidden activations. Our analysis reveals a marked contrast in quantization effectiveness across accessible models.
arXiv Detail & Related papers (2024-05-31T14:24:33Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.<n>We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
Large Language Models are Miscalibrated In-Context Learners [22.30783674111999]
In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods.<n>We observe that the miscalibration problem exists across all learning methods in low-resource setups.<n>We find that self-ensembling with max probability produces robust and calibrated predictions.
arXiv Detail & Related papers (2023-12-21T11:55:10Z)
On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z)
On the Impact of Calibration Data in Post-training Quantization and Pruning [36.1039389951318]
Quantization and pruning form the foundation of compression for neural networks. We present the first empirical study on the effect of calibration data upon model compression methods.
arXiv Detail & Related papers (2023-11-16T10:30:00Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.