Teacher Intervention: Improving Convergence of Quantization Aware
Training for Ultra-Low Precision Transformers
- URL: http://arxiv.org/abs/2302.11812v1
- Date: Thu, 23 Feb 2023 06:48:24 GMT
- Title: Teacher Intervention: Improving Convergence of Quantization Aware
Training for Ultra-Low Precision Transformers
- Authors: Minsoo Kim, Kyuhong Shim, Seongmin Park, Wonyong Sung, Jungwook Choi
- Abstract summary: Quantization-aware training (QAT) is a promising method to lower the implementation cost and energy consumption.
This work proposes a proactive knowledge distillation method called Teacher Intervention (TI) for fast converging QAT of ultra-low precision pre-trained Transformers.
- Score: 17.445202457319517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Transformer models such as BERT have shown great success in a
wide range of applications, but at the cost of substantial increases in model
complexity. Quantization-aware training (QAT) is a promising method to lower
the implementation cost and energy consumption. However, aggressive
quantization below 2-bit causes considerable accuracy degradation due to
unstable convergence, especially when the downstream dataset is not abundant.
This work proposes a proactive knowledge distillation method called Teacher
Intervention (TI) for fast converging QAT of ultra-low precision pre-trained
Transformers. TI intervenes layer-wise signal propagation with the intact
signal from the teacher to remove the interference of propagated quantization
errors, smoothing loss surface of QAT and expediting the convergence.
Furthermore, we propose a gradual intervention mechanism to stabilize the
recovery of subsections of Transformer layers from quantization. The proposed
schemes enable fast convergence of QAT and improve the model accuracy
regardless of the diverse characteristics of downstream fine-tuning tasks. We
demonstrate that TI consistently achieves superior accuracy with significantly
lower fine-tuning iterations on well-known Transformers of natural language
processing as well as computer vision compared to the state-of-the-art QAT
methods.
Related papers
- StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths [49.94623294999562]
Quantization-aware training (QAT) is essential for deploying large models under strict memory and latency constraints.<n>Common approaches based on the straight-through estimator (STE) or soft quantizers often suffer from mismatch, instability, or high computational overhead.<n>We propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low-bit settings.
arXiv Detail & Related papers (2026-01-27T08:00:57Z) - Continual Quantum Architecture Search with Tensor-Train Encoding: Theory and Applications to Signal Processing [68.35481158940401]
CL-QAS is a continual quantum architecture search framework.<n>It mitigates challenges of costly encoding amplitude and forgetting in variational quantum circuits.<n>It achieves controllable robustness expressivity, sample-efficient generalization, and smooth convergence without barren plateaus.
arXiv Detail & Related papers (2026-01-10T02:36:03Z) - CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training [73.46600457802693]
We introduce a new method that counteracts the loss induced by quantization.<n>CAGE significantly improves upon the state-of-theart methods in terms of accuracy, for similar computational cost.<n>For QAT pre-training of Llama models, CAGE matches the accuracy achieved at 4-bits (W4A4) with the prior best method.
arXiv Detail & Related papers (2025-10-21T16:33:57Z) - Progressive Element-wise Gradient Estimation for Neural Network Quantization [2.1413624861650358]
Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions.<n>We propose Progressive Element-wise Gradient Estimation (PEGE) to address discretization errors between continuous and quantized values.<n>PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
arXiv Detail & Related papers (2025-08-27T15:59:36Z) - LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Image and Video Generation [41.66473889057111]
Diffusion Transformers (DiTs) have achieved impressive performance in text-to-image and text-to-video generation.<n>DiTs' high computational cost and large parameter sizes pose significant challenges for usage in resource-constrained scenarios.<n>We propose LRQ-DiT, an efficient and accurate post-training quantization framework for image and video generation.
arXiv Detail & Related papers (2025-08-05T14:16:11Z) - PQCAD-DM: Progressive Quantization and Calibration-Assisted Distillation for Extremely Efficient Diffusion Model [8.195126516665914]
Diffusion models excel in image generation but are computational and resource-intensive.<n>We propose PQCAD-DM, a novel hybrid compression framework combining Progressive Quantization (PQ) and CAD-Assisted Distillation (CAD)<n>PQ employs a two-stage quantization with adaptive bit-width transitions guided by a momentum-based mechanism, reducing excessive weight perturbations in low-precision.
arXiv Detail & Related papers (2025-06-20T06:43:27Z) - Lightweight Task-Oriented Semantic Communication Empowered by Large-Scale AI Models [66.57755931421285]
Large-scale artificial intelligence (LAI) models pose significant challenges for real-time communication scenarios.<n>This paper proposes utilizing knowledge distillation (KD) techniques to extract and condense knowledge from LAI models.<n>We propose a fast distillation method featuring a pre-stored compression mechanism that eliminates the need for repetitive inference.
arXiv Detail & Related papers (2025-06-16T08:42:16Z) - FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z) - Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization [25.441086332799348]
Low-Rank Adaptation (LoRA) has become a standard approach for parameter-efficient fine-tuning.<n>We extend the sinusoidal transformation framework to quantized LoRA adapters.
arXiv Detail & Related papers (2025-05-28T02:15:15Z) - Stabilizing Quantization-Aware Training by Implicit-Regularization on Hessian Matrix [0.7261171488281837]
We find that the sharp landscape of loss, which leads to a dramatic performance drop, is an essential factor that causes instability.
We propose Feature-Perturbed Quantization (FPQ) to generalize and employ the feature distillation method to the quantized model.
arXiv Detail & Related papers (2025-03-14T07:56:20Z) - Oscillations Make Neural Networks Robust to Quantization [0.16385815610837165]
We show that oscillations in Quantization Aware Training (QAT) are undesirable artifacts caused by the Straight-Through Estimator (STE)
We propose a novel regularization method that induces oscillations to improve quantization.
arXiv Detail & Related papers (2025-02-01T16:39:58Z) - GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.
Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.
Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints [51.83081671798784]
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability.
DiT's practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference.
We propose Skip-DiT, a novel DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets.
arXiv Detail & Related papers (2024-11-26T17:28:10Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing [5.174900115018253]
We propose a data-free post-training quantization (PTQ) method for efficient Diffusion Transformers (DiTs)
DiTAS relies on the proposed temporal-aggregated smoothing techniques to mitigate the impact of the channel-wise outliers within the input activations.
We show that our approach enables 4-bit weight, 8-bit activation (W4A8) quantization for DiTs while maintaining comparable performance as the full-precision model.
arXiv Detail & Related papers (2024-09-12T05:18:57Z) - Adaptive variational quantum dynamics simulations with compressed circuits and fewer measurements [4.2643127089535104]
We show an improved version of the adaptive variational quantum dynamics simulation (AVQDS) method, which we call AVQDS(T)
The algorithm adaptively adds layers of disjoint unitary gates to the ansatz circuit so as to keep the McLachlan distance, a measure of the accuracy of the variational dynamics, below a fixed threshold.
We also show a method based on eigenvalue truncation to solve the linear equations of motion for the variational parameters with enhanced noise resilience.
arXiv Detail & Related papers (2024-08-13T02:56:43Z) - DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers [2.0862654518798034]
We propose a Distribution-Friendly and Outlier-Aware Post-training Quantization method for Vision Transformers.
DopQ-ViT analyzes the inefficiencies of current quantizers and introduces a distribution-friendly Tan Quantizer called TanQ.
DopQ-ViT has been extensively validated and significantly improves the performance of quantization models.
arXiv Detail & Related papers (2024-08-06T16:40:04Z) - Towards Accurate Post-Training Quantization for Vision Transformer [48.779346466374406]
Existing post-training quantization methods still cause severe performance drops.
APQ-ViT surpasses the existing post-training quantization methods by convincing margins.
arXiv Detail & Related papers (2023-03-25T03:05:26Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Neural Networks with Quantization Constraints [111.42313650830248]
We present a constrained learning approach to quantization training.
We show that the resulting problem is strongly dual and does away with gradient estimations.
We demonstrate that the proposed approach exhibits competitive performance in image classification tasks.
arXiv Detail & Related papers (2022-10-27T17:12:48Z) - BiT: Robustly Binarized Multi-distilled Transformer [36.06192421902272]
We develop binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline within as little as 5.9%.
These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline within as little as 5.9%.
arXiv Detail & Related papers (2022-05-25T19:01:54Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.