Towards Accurate Post-training Quantization for Reparameterized Models
- URL: http://arxiv.org/abs/2402.16121v1
- Date: Sun, 25 Feb 2024 15:42:12 GMT
- Title: Towards Accurate Post-training Quantization for Reparameterized Models
- Authors: Luoming Zhang, Yefei He, Wen Fei, Zhenyu Lou, Weijia Wu, YangWei Ying,
and Hong Zhou
- Abstract summary: Current Post-training Quantization (PTQ) methods often lead to significant accuracy degradation.
This is primarily caused by channel-specific and sample-specific outliers.
We propose RepAPQ, a novel framework that preserves the accuracy of quantized reparameterization models.
- Score: 6.158896686945439
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model reparameterization is a widely accepted technique for improving
inference speed without compromising performance. However, current
Post-training Quantization (PTQ) methods often lead to significant accuracy
degradation when applied to reparameterized models. This is primarily caused by
channel-specific and sample-specific outliers, which appear only at specific
samples and channels and impact on the selection of quantization parameters. To
address this issue, we propose RepAPQ, a novel framework that preserves the
accuracy of quantized reparameterization models. Different from previous
frameworks using Mean Squared Error (MSE) as a measurement, we utilize Mean
Absolute Error (MAE) to mitigate the influence of outliers on quantization
parameters. Our framework comprises two main components: Quantization
Protecting Reparameterization and Across-block Calibration. For effective
calibration, Quantization Protecting Reparameterization combines multiple
branches into a single convolution with an affine layer. During training, the
affine layer accelerates convergence and amplifies the output of the
convolution to better accommodate samples with outliers. Additionally,
Across-block Calibration leverages the measurement of stage output as
supervision to address the gradient problem introduced by MAE and enhance the
interlayer correlation with quantization parameters. Comprehensive experiments
demonstrate the effectiveness of RepAPQ across various models and tasks. Our
framework outperforms previous methods by approximately 1\% for 8-bit PTQ and
2\% for 6-bit PTQ, showcasing its superior performance. The code is available
at \url{https://github.com/ilur98/DLMC-QUANT}.
Related papers
- Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment.
It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts.
We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z) - OAC: Output-adaptive Calibration for Accurate Post-training Quantization [30.115888331426515]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs)
Most PTQ approaches formulate the quantization error based on a calibrated layer-wise $ell$ loss.
We propose Output-adaptive (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z) - PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks [4.827161693957252]
Non-quantized elementwise operations dominate the inference cost of low-precision models.
PikeLPN model addresses these issues by applying quantization to both elementwise operations and multiply-accumulate operations.
arXiv Detail & Related papers (2024-03-29T18:23:34Z) - Post-Training Quantization for Re-parameterization via Coarse & Fine
Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight.
We develop an improved KL metric to determine optimal quantization scales for activation.
For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z) - Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model [81.55141188169621]
We equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios.
We propose an intra-block enhancement module, which introduces a linear projection head whose weights are generated from a hyper-complex layer.
Our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters.
arXiv Detail & Related papers (2023-11-28T11:23:34Z) - Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision
Post-Training Quantization [7.392278887917975]
We propose a mixed-precision post training quantization approach that assigns different numerical precisions to tensors in a network based on their specific needs.
Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48%$, $21.69%$, and $33.28%$ respectively.
arXiv Detail & Related papers (2023-06-08T02:18:58Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - PD-Quant: Post-Training Quantization based on Prediction Difference
Metric [43.81334288840746]
Post-training quantization (PTQ) is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types.
How to determine the appropriate quantization parameters is the main problem facing now.
PD-Quant is a method that addresses this limitation by considering global information.
arXiv Detail & Related papers (2022-12-14T05:48:58Z) - Optimization of Annealed Importance Sampling Hyperparameters [77.34726150561087]
Annealed Importance Sampling (AIS) is a popular algorithm used to estimates the intractable marginal likelihood of deep generative models.
We present a parameteric AIS process with flexible intermediary distributions and optimize the bridging distributions to use fewer number of steps for sampling.
We assess the performance of our optimized AIS for marginal likelihood estimation of deep generative models and compare it to other estimators.
arXiv Detail & Related papers (2022-09-27T07:58:25Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.