Related papers: ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

URL: http://arxiv.org/abs/2511.10645v1
Date: Fri, 14 Nov 2025 02:01:06 GMT
Title: ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
Authors: Yesheng Liang, Haisheng Chen, Song Han, Zhijian Liu,
Abstract summary: Post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference.<n>The presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation.<n>We propose Pairwise Rotation Quantization (ParoQuant) to suppress outliers and introduce significant overhead during inference.<n>ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead.
Score: 13.283581083797484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

Related papers

D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs [33.883527341335856]
Weight-only post-training quantization (PTQ) is appealing as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware.<n> accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision.<n>We propose D$2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives.
arXiv Detail & Related papers (2026-01-30T05:49:48Z)
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models [41.677469535447024]
Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices.<n>Post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration.<n>Recent advances for post-training quantization have demonstrated that even sub-4-bit methods can maintain most of the original model performance.
arXiv Detail & Related papers (2025-12-25T12:39:36Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment [15.802372921412198]
We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data.<n>We first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance.
arXiv Detail & Related papers (2025-09-24T15:10:44Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models [25.783531928577233]
BASE-Q is a simple yet powerful approach that combines bias correction and asymmetric scaling to reduce rounding and clipping errors.<n>Experiments demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5%, 42.9%, and 29.2% compared to QuaRot, SpinQuant, and OSTQuant, respectively.
arXiv Detail & Related papers (2025-05-26T14:22:21Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z)
OAC: Output-adaptive Calibration for Accurate Post-training Quantization [28.67781845829386]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs)<n>Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output.<n>We propose Output-adaptive Quantization (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z)
CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.