Understanding and Improving Knowledge Distillation for
Quantization-Aware Training of Large Transformer Encoders
- URL: http://arxiv.org/abs/2211.11014v1
- Date: Sun, 20 Nov 2022 16:23:23 GMT
- Title: Understanding and Improving Knowledge Distillation for
Quantization-Aware Training of Large Transformer Encoders
- Authors: Minsoo Kim, Sihwa Lee, Sukjin Hong, Du-Seong Chang, Jungwook Choi
- Abstract summary: We provide an in-depth analysis of the mechanism of KD on attention recovery of quantized large Transformers.
We propose two KD methods; attention-map and attention-output losses.
The experimental results on various Transformer encoder models demonstrate that the proposed KD methods achieve state-of-the-art accuracy for QAT with sub-2-bit weight quantization.
- Score: 5.396898627891066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) has been a ubiquitous method for model
compression to strengthen the capability of a lightweight model with the
transferred knowledge from the teacher. In particular, KD has been employed in
quantization-aware training (QAT) of Transformer encoders like BERT to improve
the accuracy of the student model with the reduced-precision weight parameters.
However, little is understood about which of the various KD approaches best
fits the QAT of Transformers. In this work, we provide an in-depth analysis of
the mechanism of KD on attention recovery of quantized large Transformers. In
particular, we reveal that the previously adopted MSE loss on the attention
score is insufficient for recovering the self-attention information. Therefore,
we propose two KD methods; attention-map and attention-output losses.
Furthermore, we explore the unification of both losses to address
task-dependent preference between attention-map and output losses. The
experimental results on various Transformer encoder models demonstrate that the
proposed KD methods achieve state-of-the-art accuracy for QAT with sub-2-bit
weight quantization.
Related papers
- Compute-Optimal Quantization-Aware Training [50.98555000360485]
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks.<n>Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy.<n>We investigate how different QAT durations impact final performance.
arXiv Detail & Related papers (2025-09-26T21:09:54Z) - Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer [9.85847764731154]
Game of Regularizer (GoR) is a learnable regularization method that adaptively balances task-specific (TS) and distillation losses.<n>GoR consistently outperforms state-of-the-art QAT-KD methods on low-power edge devices.<n>We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models.
arXiv Detail & Related papers (2025-09-25T07:43:13Z) - Data-Augmented Quantization-Aware Knowledge Distillation [1.8126132932201138]
Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models.<n>The relationship between quantization-aware KD and data augmentation (DA) remains unexplored.<n>We propose a novel metric which evaluates DAs according to their capacity to maximize the Contextual Mutual Information.
arXiv Detail & Related papers (2025-09-04T03:24:35Z) - 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment.
It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts.
We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models [5.69541128149828]
Large generative models such as large language models (LLMs) and diffusion models have revolutionized the fields of NLP and computer vision respectively.
In this study, we propose a light-weight quantization aware fine tuning technique using knowledge distillation (KD-QAT) to improve the performance of 4-bit weight quantized LLMs.
We show that ov-freeze results in near floating point precision performance, i.e., less than 0.7% loss of accuracy on Commonsense Reasoning benchmarks.
arXiv Detail & Related papers (2024-03-26T23:51:44Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Sequence-Level Knowledge Distillation for Class-Incremental End-to-End
Spoken Language Understanding [10.187334662184314]
We tackle the problem of Spoken Language Understanding applied to a continual learning setting.
We propose three knowledge distillation approaches to mitigate forgetting for a sequence-to-sequence transformer model.
arXiv Detail & Related papers (2023-05-23T10:24:07Z) - Quick Dense Retrievers Consume KALE: Post Training Kullback Leibler
Alignment of Embeddings for Asymmetrical dual encoders [89.29256833403169]
We introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods.
KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation.
Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
arXiv Detail & Related papers (2023-03-31T15:44:13Z) - Teacher Intervention: Improving Convergence of Quantization Aware
Training for Ultra-Low Precision Transformers [17.445202457319517]
Quantization-aware training (QAT) is a promising method to lower the implementation cost and energy consumption.
This work proposes a proactive knowledge distillation method called Teacher Intervention (TI) for fast converging QAT of ultra-low precision pre-trained Transformers.
arXiv Detail & Related papers (2023-02-23T06:48:24Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - Quantum key distribution over quantum repeaters with encoding: Using
Error Detection as an Effective Post-Selection Tool [0.9176056742068812]
We show that it is often more efficient to use the error detection, rather than the error correction, capability of the underlying code to sift out cases where an error has been detected.
We implement our technique for three-qubit repetition codes by modelling different sources of error in crucial components of the system.
arXiv Detail & Related papers (2020-07-13T13:37:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.