Explaining How Quantization Disparately Skews a Model
- URL: http://arxiv.org/abs/2509.07222v1
- Date: Mon, 08 Sep 2025 21:04:16 GMT
- Title: Explaining How Quantization Disparately Skews a Model
- Authors: Abhimanyu Bellam, Jung-Eun Kim,
- Abstract summary: Post Training Quantization (PTQ) is widely adopted due to its high compression capacity and speed with minimal impact on accuracy.<n>We observed that disparate impacts are exacerbated by quantization, especially for minority groups.<n>We explore how the changes in weights and activations induced by quantization cause cascaded impacts in the network, resulting in logits with lower variance, increased loss, and compromised group accuracies.
- Score: 8.210473195536077
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Post Training Quantization (PTQ) is widely adopted due to its high compression capacity and speed with minimal impact on accuracy. However, we observed that disparate impacts are exacerbated by quantization, especially for minority groups. Our analysis explains that in the course of quantization there is a chain of factors attributed to a disparate impact across groups during forward and backward passes. We explore how the changes in weights and activations induced by quantization cause cascaded impacts in the network, resulting in logits with lower variance, increased loss, and compromised group accuracies. We extend our study to verify the influence of these impacts on group gradient norms and eigenvalues of the Hessian matrix, providing insights into the state of the network from an optimization point of view. To mitigate these effects, we propose integrating mixed precision Quantization Aware Training (QAT) with dataset sampling methods and weighted loss functions, therefore providing fair deployment of quantized neural networks.
Related papers
- Outlier-Aware Post-Training Quantization for Image Super-Resolution [36.37192976473005]
Quantization techniques, including quantization-aware training (QAT) and post-training quantization (PTQ), have become essential for inference acceleration of image super-resolution (SR) networks.<n>We propose a dual-region quantization strategy that partitions activations into an outlier region and a dense region, applying uniform quantization to each region independently to better balance bit-width allocation.<n>Our method outperforms existing PTQ approaches across various SR networks and datasets, while achieving performance comparable to QAT methods in most scenarios with at least a 75 speedup.
arXiv Detail & Related papers (2025-11-01T19:49:33Z) - Assessing the Potential for Catastrophic Failure in Dynamic Post-Training Quantization [3.437656066916039]
Post-training quantization (PTQ) has emerged as an effective tool for reducing the computational complexity and memory usage of a neural network.<n>There is the potential for drastic performance reduction depending upon the distribution of inputs experienced in inference.
arXiv Detail & Related papers (2025-10-02T18:13:06Z) - Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - Investigating the Impact of Quantization on Adversarial Robustness [22.637585106574722]
Quantization is a technique for reducing the bit-width of deep models to improve their runtime performance and storage efficiency.
In real-world scenarios, quantized models are often faced with adversarial attacks which cause the model to make incorrect inferences.
We conduct a first-time analysis of the impact of the quantization pipeline components that can incorporate robust optimization.
arXiv Detail & Related papers (2024-04-08T16:20:15Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z) - FIT: A Metric for Model Sensitivity [1.2622086660704197]
We propose FIT, which combines the Fisher information with a model of quantization.
We find that FIT can estimate the final performance of a network without retraining.
FIT is fast to compute when compared to existing methods, demonstrating favourable convergence properties.
arXiv Detail & Related papers (2022-10-16T10:25:29Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - How Tempering Fixes Data Augmentation in Bayesian Neural Networks [22.188535244056016]
We show that tempering implicitly reduces the misspecification arising from modeling augmentations as i.i.d. data.
The temperature mimics the role of the effective sample size, reflecting the gain in information provided by the augmentations.
arXiv Detail & Related papers (2022-05-27T11:06:56Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Where Should We Begin? A Low-Level Exploration of Weight Initialization
Impact on Quantized Behaviour of Deep Neural Networks [93.4221402881609]
We present an in-depth, fine-grained ablation study of the effect of different weights initialization on the final distributions of weights and activations of different CNN architectures.
To our best knowledge, we are the first to perform such a low-level, in-depth quantitative analysis of weights initialization and its effect on quantized behaviour.
arXiv Detail & Related papers (2020-11-30T06:54:28Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.