Related papers: Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs

Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs

URL: http://arxiv.org/abs/2603.04308v1
Date: Wed, 04 Mar 2026 17:26:29 GMT
Title: Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs
Authors: Pranav Kumar Kaliaperumal,
Abstract summary: Post-training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers.<n>This paper provides a reproducible empirical reproduction and systems-level extension of that phenomenon in BERT-base fine-tuned on QNLI.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers, as originally analyzed by Bondarenko et al. (EMNLP 2021) in work associated with Qualcomm AI Research. This paper provides a reproducible empirical reproduction and systems-level extension of that phenomenon in BERT-base fine-tuned on QNLI. When global W8A8 quantization is applied, validation accuracy drops sharply from 89.66% (FP32) to 54.33%, a decrease of 35.33 points. Statistical analysis of FP32 activations shows strongly heavy-tailed behavior that intensifies with model depth: kurtosis reaches 271 in the final layers and approximately 55% of activation energy is concentrated in the top 1% of channels. We evaluate several mitigation strategies. Mixed precision PTQ restores accuracy close to the FP32 baseline (89.42%). Per-embedding-group (PEG) quantization shows strong sensitivity to grouping structure, improving accuracy from 66.12% with three groups to 86.18% with four groups. In contrast, percentile-based calibration, even at thresholds between 99.0 and 99.99, fails to recover accuracy (about 50.54%), indicating that large activation channels encode structured signal rather than rare noise. Deployment profiling on an RTX 3050 GPU shows minimal differences in latency and memory usage across methods (median latency about 58-59 ms; VRAM usage about 484-486 MB), highlighting the importance of hardware-aware evaluation. Overall, the results show that PTQ failure in transformers is primarily driven by structured channel dominance amplified through residual connections. Effective mitigation therefore requires channel-aware precision allocation rather than scalar clipping alone.

Related papers

Distributional Reinforcement Learning with Information Bottleneck for Uncertainty-Aware DRAM Equalization [8.695939803795499]
We propose a distributional risk-sensitive reinforcement learning framework integrating Information Bottleneck latent representations with Conditional Value-at-Risk optimization.<n>We introduce rate-distortion optimal signal compression achieving 51 times speedup over eye diagrams.<n>We show that the proposed framework provides a practical solution for production-scale equalizer optimization with certified worst-case guarantees.
arXiv Detail & Related papers (2026-03-05T03:34:25Z)
Dissecting Outlier Dynamics in LLM NVFP4 Pretraining [46.10969678564592]
This study conducts a longitudinal analysis of outlier dynamics across architecture during NVFP4 pretraining.<n>We find that, compared with Softmax Attention (SA), Linear Attention (LA) reduces per-tensor heavy tails but still exhibits persistent block-level spikes under block quantization.<n>We then develop CHON, an NVFP4 training recipe integrating with post-QK operation protection.
arXiv Detail & Related papers (2026-02-02T12:50:27Z)
Understanding vision transformer robustness through the lens of out-of-distribution detection [59.72757235382676]
Quantization reduces memory and inference costs at the risk of performance loss.<n>We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common out-of-distribution (OOD) datasets.
arXiv Detail & Related papers (2026-02-01T22:00:59Z)
Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts [6.221156050218661]
We present a curiosity-driven quantized Mixture-of-Experts framework for deep neural networks on resource-constrained devices.<n>Our 4-bit quantization maintains 99.9 percent of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression and 41 percent energy savings.<n>Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models.
arXiv Detail & Related papers (2025-11-13T15:32:41Z)
Feature Selection and Regularization in Multi-Class Classification: An Empirical Study of One-vs-Rest Logistic Regression with Gradient Descent Optimization and L1 Sparsity Constraints [0.0]
Multi-class wine classification presents fundamental trade-offs between model accuracy, feature dimensionality, and interpretability.<n>This paper presents a comprehensive empirical study of One-vs-Rest logistic regression on the UCI Wine dataset.
arXiv Detail & Related papers (2025-10-16T08:47:05Z)
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration [46.63567524455431]
Low-precision floating-point formats provide stability, memory savings, and hardware efficiency without dequantization overhead.<n>We propose Exponent-Concentrated FP8 (ECF8), a compression framework with entropy-aware encoding and GPU-optimized decoding.<n>Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration.
arXiv Detail & Related papers (2025-10-03T02:22:13Z)
APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers [71.2294205496784]
We propose textbfAPHQ-ViT, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH)<n>We show that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks.
arXiv Detail & Related papers (2025-04-03T11:48:56Z)
Intelligent Fault Diagnosis of Type and Severity in Low-Frequency, Low Bit-Depth Signals [0.6144680854063939]
The research leverages sound data from the imbalanced MaFaulDa dataset, aiming to strike a balance between high performance and low resource consumption. We achieved an impressive accuracy of 99.54% and an F-Beta score of 99.52% with just 6 boosting trees at an 8 kHz, 8-bit configuration.
arXiv Detail & Related papers (2024-11-09T22:01:11Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear.<n>We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
Accurate and Reliable Predictions with Mutual-Transport Ensemble [46.368395985214875]
We propose a co-trained auxiliary model and adaptively regularizes the cross-entropy loss using Kullback-Leibler (KL) We show that MTE can simultaneously enhance both accuracy and uncertainty calibration. For example, on the CIFAR-100 dataset, our MTE method on ResNet34/50 achieved significant improvements compared to previous state-of-the-art method.
arXiv Detail & Related papers (2024-05-30T03:15:59Z)
QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning [16.50084447690437]
The study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, textbfQuantTune. Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT.
arXiv Detail & Related papers (2024-03-11T08:09:30Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.