Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models
- URL: http://arxiv.org/abs/2508.18609v2
- Date: Wed, 27 Aug 2025 04:50:50 GMT
- Title: Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models
- Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu,
- Abstract summary: Post-training quantization (PTQ) is a practical compression solution for large language models (LLMs)<n>Existing scaling laws for quantized models often overlook PTQ-specific parameters and task-specific sensitivities.<n>This paper addresses these gaps by conducting an empirical investigation to establish task-stratified scaling laws.
- Score: 31.666361168937396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ's impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.
Related papers
- Towards Understanding Best Practices for Quantization of Vision-Language Models [42.75375241956508]
Large language models (LLMs) deliver impressive results for a variety of tasks.<n>To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision.<n>We investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines.
arXiv Detail & Related papers (2026-01-21T18:59:51Z) - Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z) - Is Quantization a Deal-breaker? Empirical Insights from Large Code Models [7.182449176083625]
We apply Activation-aware Weight Quantization (AWQ) to two widely used code models, CodeLlama and DeepSeekCoder, to generate Java and Python code.<n>Our findings reveal that quantization is a robust technique that not only preserves functional correctness, but also retains key qualitative code attributes sought after by developers.
arXiv Detail & Related papers (2025-07-13T14:58:19Z) - Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries.<n>These massive values play a critical role in interpreting contextual knowledge.<n>We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z) - Scaling Laws for Post Training Quantized Large Language Models [41.78467383320145]
Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size.<n>The quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice.
arXiv Detail & Related papers (2024-10-15T23:34:22Z) - Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview [4.166341398835636]
We discuss the necessity and impact of model size growth, highlighting the performance benefits as well as the computational challenges and environmental considerations.
We delve into various quantization techniques, including both post-training quantization (PTQ) and quantization-aware training (QAT)
We examine how these methods address issues like outliers, importance weighting, and activation quantization, ultimately contributing to more sustainable and accessible deployment of large-scale models.
arXiv Detail & Related papers (2024-09-18T02:35:00Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective [106.92016199403042]
We empirically investigate knowledge transfer from larger to smaller models through a parametric perspective.
We employ sensitivity-based techniques to extract and align knowledge-specific parameters between different large language models.
Our findings highlight the critical factors contributing to the process of parametric knowledge transfer.
arXiv Detail & Related papers (2023-10-17T17:58:34Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.