Related papers: R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization

R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization

URL: http://arxiv.org/abs/2511.21736v1
Date: Fri, 21 Nov 2025 12:39:44 GMT
Title: R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization
Authors: Jiayi Chen, Jieqi Shi, Jing Huo, Chen Wu,
Abstract summary: Residual Refinement Quantization (R2Q) is a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations.<n>R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings.
Score: 20.861971198175674
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.

Related papers

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
Q-SAM2: Accurate Quantization for Segment Anything Model 2 [19.438737615421598]
We propose an accurate low-bit quantization method for efficient Segment Anything Model 2 (SAM2)<n>Q-SAM2 addresses the performance degradation caused by the singularities in weight and activation distributions during quantization.<n>Our experiments demonstrate that Q-SAM2 allows for highly accurate inference while substantially improving efficiency.
arXiv Detail & Related papers (2025-06-11T14:21:38Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment. It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z)
RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization [8.827794405944637]
Post-training quantization (PTQ) is a promising solution for compressing large transformer models. Existing PTQ methods typically exhibit non-trivial performance loss. We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
arXiv Detail & Related papers (2024-02-08T12:35:41Z)
CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)
MRQ:Support Multiple Quantization Schemes through Model Re-Quantization [0.17499351967216337]
Deep learning models cannot be easily quantized for diverse fixed-point hardwares. New type of model quantization approach called model re-quantization is proposed. Models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.
arXiv Detail & Related papers (2023-08-01T08:15:30Z)
Training Multi-bit Quantized and Binarized Networks with A Learnable Symmetric Quantizer [1.9659095632676098]
Quantizing weights and activations of deep neural networks is essential for deploying them in resource-constrained devices or cloud platforms. While binarization is a special case of quantization, this extreme case often leads to several training difficulties. We develop a unified quantization framework, denoted as UniQ, to overcome binarization difficulties.
arXiv Detail & Related papers (2021-04-01T02:33:31Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)
Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides. We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models. Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.