Softmax Bias Correction for Quantized Generative Models
- URL: http://arxiv.org/abs/2309.01729v1
- Date: Mon, 4 Sep 2023 17:29:31 GMT
- Title: Softmax Bias Correction for Quantized Generative Models
- Authors: Nilesh Prasad Pandey, Marios Fournarakis, Chirag Patel, Markus Nagel
- Abstract summary: Post-training quantization (PTQ) is the go-to compression technique for large generative models, such as stable diffusion or large language models.
PTQ methods commonly keep the softmax activation in higher precision as it has been shown to be very sensitive to quantization noise.
This can lead to a significant runtime and power overhead during inference on resource-constraint edge devices.
We propose an offline bias correction technique that improves the quantizability of softmax without additional compute during deployment.
- Score: 8.953308552614438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-training quantization (PTQ) is the go-to compression technique for large
generative models, such as stable diffusion or large language models. PTQ
methods commonly keep the softmax activation in higher precision as it has been
shown to be very sensitive to quantization noise. However, this can lead to a
significant runtime and power overhead during inference on resource-constraint
edge devices. In this work, we investigate the source of the softmax
sensitivity to quantization and show that the quantization operation leads to a
large bias in the softmax output, causing accuracy degradation. To overcome
this issue, we propose an offline bias correction technique that improves the
quantizability of softmax without additional compute during deployment, as it
can be readily absorbed into the quantization parameters. We demonstrate the
effectiveness of our method on stable diffusion v1.5 and 125M-size OPT language
model, achieving significant accuracy improvement for 8-bit quantized softmax.
Related papers
- 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment.
It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts.
We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z) - QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning [52.157939524815866]
In this paper, we empirically unravel three properties in quantized diffusion models that compromise the efficacy of current methods.
We identify two critical types of quantized layers: those holding vital temporal information and those sensitive to reduced bit-width.
Our method is evaluated over three high-resolution image generation tasks and achieves state-of-the-art performance under various bit-width settings.
arXiv Detail & Related papers (2024-02-06T03:39:44Z) - Post-training Quantization for Text-to-Image Diffusion Models with Progressive Calibration and Activation Relaxing [49.800746112114375]
We propose a novel post-training quantization method (Progressive and Relaxing) for text-to-image diffusion models.
We are the first to achieve quantization for Stable Diffusion XL while maintaining the performance.
arXiv Detail & Related papers (2023-11-10T09:10:09Z) - MixQuant: Mixed Precision Quantization with a Bit-width Optimization
Search [7.564770908909927]
Quantization is a technique for creating efficient Deep Neural Networks (DNNs)
We propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error.
We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone.
arXiv Detail & Related papers (2023-09-29T15:49:54Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - Towards Accurate Post-Training Quantization for Vision Transformer [48.779346466374406]
Existing post-training quantization methods still cause severe performance drops.
APQ-ViT surpasses the existing post-training quantization methods by convincing margins.
arXiv Detail & Related papers (2023-03-25T03:05:26Z) - Density-Softmax: Efficient Test-time Model for Uncertainty Estimation and Robustness under Distribution Shifts [8.431465371266391]
Density-Softmax is a sampling-free deterministic framework for uncertainty estimation.
We show that our model is the solution of minimax uncertainty risk.
Our method enjoys competitive results with state-of-the-art techniques in terms of uncertainty and robustness.
arXiv Detail & Related papers (2023-02-13T16:21:03Z) - Q-Diffusion: Quantizing Diffusion Models [52.978047249670276]
Post-training quantization (PTQ) is considered a go-to compression method for other tasks.
We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture.
We show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance.
arXiv Detail & Related papers (2023-02-08T19:38:59Z) - Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization [9.062897838978955]
Various post-training quant uniformization methods have usually been based on convex optimization.
Our proposed technique presents higher model accuracy, especially for a low quantization.
arXiv Detail & Related papers (2021-05-05T05:14:22Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.