Related papers: Improving Quantization-aware Training of Low-Precision Network via Block Replacement on Full-Precision Counterpart

Improving Quantization-aware Training of Low-Precision Network via Block Replacement on Full-Precision Counterpart

URL: http://arxiv.org/abs/2412.15846v1
Date: Fri, 20 Dec 2024 12:38:18 GMT
Title: Improving Quantization-aware Training of Low-Precision Network via Block Replacement on Full-Precision Counterpart
Authors: Chengting Yu, Shu Yang, Fengzhao Zhang, Hanzhi Ma, Aili Wang, Er-Ping Li,
Abstract summary: Quantization-aware training (QAT) is a common paradigm for network quantization.<n>The low-precision model exhibits limited representation capabilities and cannot directly replicate full-precision calculations.<n>We propose a general QAT framework for alleviating the concerns by permitting the forward and backward processes of the low-precision network to be guided by the full-precision partner.
Score: 1.5508907979229383
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization-aware training (QAT) is a common paradigm for network quantization, in which the training phase incorporates the simulation of the low-precision computation to optimize the quantization parameters in alignment with the task goals. However, direct training of low-precision networks generally faces two obstacles: 1. The low-precision model exhibits limited representation capabilities and cannot directly replicate full-precision calculations, which constitutes a deficiency compared to full-precision alternatives; 2. Non-ideal deviations during gradient propagation are a common consequence of employing pseudo-gradients as approximations in derived quantized functions. In this paper, we propose a general QAT framework for alleviating the aforementioned concerns by permitting the forward and backward processes of the low-precision network to be guided by the full-precision partner during training. In conjunction with the direct training of the quantization model, intermediate mixed-precision models are generated through the block-by-block replacement on the full-precision model and working simultaneously with the low-precision backbone, which enables the integration of quantized low-precision blocks into full-precision networks throughout the training phase. Consequently, each quantized block is capable of: 1. simulating full-precision representation during forward passes; 2. obtaining gradients with improved estimation during backward passes. We demonstrate that the proposed method achieves state-of-the-art results for 4-, 3-, and 2-bit quantization on ImageNet and CIFAR-10. The proposed framework provides a compatible extension for most QAT methods and only requires a concise wrapper for existing codes.

Related papers

MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
Post-Training Quantization for Video Matting [20.558324038808664]
Video matting is crucial for applications such as film production and virtual reality.<n>Post-Training Quantization (PTQ) is still in its nascent stages for video matting.<n>This paper proposes a novel and general PTQ framework specifically designed for video matting models.
arXiv Detail & Related papers (2025-06-12T15:57:14Z)
Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining [0.0]
Post-training quantization reduces model size efficiently at the cost of decreased accuracy. quantization-aware training better preserves accuracy but is resource-intensive. We propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining.
arXiv Detail & Related papers (2025-04-14T19:31:21Z)
GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z)
PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks [4.827161693957252]
Non-quantized elementwise operations dominate the inference cost of low-precision models. PikeLPN model addresses these issues by applying quantization to both elementwise operations and multiply-accumulate operations.
arXiv Detail & Related papers (2024-03-29T18:23:34Z)
CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)
Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models [88.80146574509195]
Quantization is a promising approach for reducing memory overhead and accelerating inference. We propose a novel-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs.
arXiv Detail & Related papers (2023-10-20T07:09:56Z)
CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level Continuous Sparsification [51.81850995661478]
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. We propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability.
arXiv Detail & Related papers (2022-12-06T05:44:21Z)
Neural Networks with Quantization Constraints [111.42313650830248]
We present a constrained learning approach to quantization training. We show that the resulting problem is strongly dual and does away with gradient estimations. We demonstrate that the proposed approach exhibits competitive performance in image classification tasks.
arXiv Detail & Related papers (2022-10-27T17:12:48Z)
AMED: Automatic Mixed-Precision Quantization for Edge Devices [3.5223695602582614]
Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance. Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths.
arXiv Detail & Related papers (2022-05-30T21:23:22Z)
DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs. Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance. We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)
QuantNet: Learning to Quantize by Learning within Fully Differentiable Framework [32.465949985191635]
This paper proposes a meta-based quantizer named QuantNet, which utilizes a differentiable sub-network to directly binarize the full-precision weights. Our method not only solves the problem of gradient mismatching, but also reduces the impact of discretization errors, caused by the binarizing operation in the deployment.
arXiv Detail & Related papers (2020-09-10T01:41:05Z)
FracBits: Mixed Precision Quantization via Fractional Bit-Widths [29.72454879490227]
Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths. We propose a novel learning-based algorithm to derive mixed precision models end-to-end under target computation constraints.
arXiv Detail & Related papers (2020-07-04T06:09:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.