SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs
- URL: http://arxiv.org/abs/2512.04746v1
- Date: Thu, 04 Dec 2025 12:35:10 GMT
- Title: SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs
- Authors: Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen,
- Abstract summary: SignRoundV2 is a post-training quantization framework that is highly effective even without mixed-precision.<n>Our method sustains competitive accuracy for Large Language Models, achieving production-grade performance with about 1 percent variance at 4-5 bits.
- Score: 4.946856266233001
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.
Related papers
- AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization [7.413057271242686]
Quantization, particularly floating-point quantization, is known to be capable of speeding up large language models (LLMs) inference.<n>We propose AMS-Quant, which explores floating-point quantization exploration from integer bitwidths to non-integer bit-widths.<n>We show that AMS-Quant can quantize the model to FP-5.33-e2m3 and FP4.25-e2m2, and significantly speed up the decoding over FP16 inference.
arXiv Detail & Related papers (2025-10-16T15:37:23Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization [73.60493264901359]
We present a unified framework for rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings.<n>We show that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off.<n>Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
arXiv Detail & Related papers (2025-02-04T18:59:26Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs [16.596819845726625]
SignRound is a method that leverages signed gradient descent (SignSGD) to optimize rounding values and weight clipping in just 200 steps.
It delivers exceptional results across 2 to 4 bits while minimizing tuning costs and avoiding additional inference overhead.
It also demonstrates strong generalization in recent models, achieving near-lossless 4-bit quantization in most scenarios.
arXiv Detail & Related papers (2023-09-11T14:58:23Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - PalQuant: Accelerating High-precision Networks on Low-precision
Accelerators [17.877271678887315]
Low-precision deep learning accelerators (DLAs) have become popular due to their advantages in chip area and energy consumption.
One way to achieve both high accuracy and efficient inference is to deploy high-precision neural networks on low-precision DLAs.
We propose the PArallel Low-precision Quantization (PalQuant) method that approximates high-precision computations via learning parallel low-precision representations from scratch.
arXiv Detail & Related papers (2022-08-03T09:44:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.