Related papers: PoTPTQ: A Two-step Power-of-Two Post-training for LLMs

PoTPTQ: A Two-step Power-of-Two Post-training for LLMs

URL: http://arxiv.org/abs/2507.11959v1
Date: Wed, 16 Jul 2025 06:44:14 GMT
Title: PoTPTQ: A Two-step Power-of-Two Post-training for LLMs
Authors: Xinyu Wang, Vahid Partovi Nia, Peng Lu, Jerry Huang, Xiao-Wen Chang, Boxing Chen, Yufei Cui,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks.<n>Power-of-two (PoT) quantization is a general tool to counteract this difficulty.<n>We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization.
Score: 27.141872509108122
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the quantization scales with a robust starting point, and (ii) refine these scales using a minimal calibration set. The performance of our PoT post-training algorithm surpasses the current state-of-the-art in integer quantization, particularly at low precisions such as 2- and 3-bit formats. Our PoT quantization accelerates the dequantization step required for the floating point inference and leads to $3.67\times$ speed up on a NVIDIA V100, and $1.63\times$ on a NVIDIA RTX 4090, compared to uniform integer dequantization.

Related papers

ICQuant: Index Coding enables Low-bit LLM Quantization [11.57957118744944]
A key challenge in weight quantization is the presence of outliers, which inflate quantization ranges and lead to large errors.<n>We present ICQuant, a novel framework that leverages outlier statistics to design an efficient index coding scheme for outlier-aware quantization.<n>Using just 2.3 bits per weight and simple scalar quantizers, ICQuant improves the zero-shot accuracy of the 2-bit Llama3-70B model by up to 130% and 150% relative to QTIP and QuIP#.
arXiv Detail & Related papers (2025-05-01T20:23:29Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization [44.547992997369875]
We propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels.<n>First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache.<n>Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error.
arXiv Detail & Related papers (2024-10-07T17:59:35Z)
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.<n>We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.<n> EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z)
GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks. Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM. We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks [1.398698203665363]
In this paper, we explore non-linear quantization techniques for exploiting lower bit precision. We developed the Quantization Aware Training (QAT) algorithm that allowed training of low bit width Power-of-Two (PoT) networks. At the same time, PoT quantization vastly reduces the computational complexity of the neural network.
arXiv Detail & Related papers (2022-03-09T19:57:14Z)
Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models. We show negligible WER change as compared to the full-precision baseline models. Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z)
n-hot: Efficient bit-level sparsity for powers-of-two neural network quantization [0.0]
Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware. PoT quantization triggers a severe accuracy drop because of its limited representation ability. We propose an efficient PoT quantization scheme that balances accuracy and costs in a memory-efficient way.
arXiv Detail & Related papers (2021-03-22T10:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.