Related papers: Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs

Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs

URL: http://arxiv.org/abs/2506.09104v1
Date: Tue, 10 Jun 2025 16:26:32 GMT
Title: Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs
Authors: Jung Hyun Lee, Seungjae Shin, Vinnam Kim, Jaeseong You, An Chen,
Abstract summary: We propose Unified Progressive Quantization (UPQ) that unifies block-wise post-training quantization with distillation-based quantization-aware training (Distill-QAT) for instruction-tuned LLM quantization.<n>To the best of our knowledge, we are the first to demonstrate that UPQ can quantize open-source instruction-tuned LLMs to INT2 without relying on proprietary post-training data.
Score: 12.050611514060023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the rapid scaling of large language models (LLMs) poses significant challenges for deployment on resource-constrained devices, there is growing interest in extremely low-bit quantization, such as 2-bit. Although prior works have shown that 2-bit large models are pareto-optimal over their 4-bit smaller counterparts in both accuracy and latency, these advancements have been limited to pre-trained LLMs and have not yet been extended to instruction-tuned models. To bridge this gap, we propose Unified Progressive Quantization (UPQ)$-$a novel progressive quantization framework (FP16$\rightarrow$INT4$\rightarrow$INT2) that unifies block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for INT2 instruction-tuned LLM quantization. UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to significantly reduce the quantization error introduced by subsequent INT2 quantization. Next, UPQ applies Distill-QAT to enable INT2 instruction-tuned LLMs to generate responses consistent with their original FP16 counterparts by minimizing the generalized Jensen-Shannon divergence (JSD) between the two. To the best of our knowledge, we are the first to demonstrate that UPQ can quantize open-source instruction-tuned LLMs to INT2 without relying on proprietary post-training data, while achieving state-of-the-art performances on MMLU and IFEval$-$two of the most representative benchmarks for evaluating instruction-tuned LLMs.

Related papers

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation [40.140261007984215]
We improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats.<n>We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications.
arXiv Detail & Related papers (2026-01-30T10:39:11Z)
What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study [59.44848132298657]
Post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings.<n>In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models.
arXiv Detail & Related papers (2026-01-21T11:22:29Z)
R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization [20.861971198175674]
Residual Refinement Quantization (R2Q) is a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations.<n>R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings.
arXiv Detail & Related papers (2025-11-21T12:39:44Z)
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z)
Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.<n>LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.<n>New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z)
SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs [10.036727981085223]
SplitQuantV2 is an innovative algorithm designed to enhance low-bit linear quantization of large language models.<n>It can achieve results comparable to those of advanced algorithms.
arXiv Detail & Related papers (2025-03-07T14:59:07Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
The Power of Negative Zero: Datatype Customization for Quantized Large Language Models [5.503925076208333]
Post-training quantization serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of large language models (LLMs)<n>In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR)<n>RaZeR remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit numerical distributions.
arXiv Detail & Related papers (2025-01-06T22:40:40Z)
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.<n>We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.<n> EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks [31.431016659268206]
This study examines 4-bit quantization methods like GPTQ in large language models (LLMs) We extend task scope to more generative categories such as code generation and abstractive summarization. We propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization.
arXiv Detail & Related papers (2023-12-14T01:06:37Z)
CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [6.85331857224501]
Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability. There are two mainstream quantization schemes for LLMs: coarse-grained ($textite.g.,$ channel-wise) quantization and fine-grained ($textite.g.,$ group-wise) quantization. We introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed.
arXiv Detail & Related papers (2023-10-07T14:50:28Z)
Towards Efficient Post-training Quantization of Pre-trained Language Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues. Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z)
BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction [29.040991149922615]
We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ) We propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time. For the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models.
arXiv Detail & Related papers (2021-02-10T13:46:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.