Related papers: Data-Augmented Quantization-Aware Knowledge Distillation

Data-Augmented Quantization-Aware Knowledge Distillation

URL: http://arxiv.org/abs/2509.03850v1
Date: Thu, 04 Sep 2025 03:24:35 GMT
Title: Data-Augmented Quantization-Aware Knowledge Distillation
Authors: Justin Kur, Kaiqi Zhao,
Abstract summary: Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models.<n>The relationship between quantization-aware KD and data augmentation (DA) remains unexplored.<n>We propose a novel metric which evaluates DAs according to their capacity to maximize the Contextual Mutual Information.
Score: 1.8126132932201138
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. Existing KD and QAT works focus on improving the accuracy of quantized models from the network output perspective by designing better KD loss functions or optimizing QAT's forward and backward propagation. However, limited attention has been given to understanding the impact of input transformations, such as data augmentation (DA). The relationship between quantization-aware KD and DA remains unexplored. In this paper, we address the question: how to select a good DA in quantization-aware KD, especially for the models with low precisions? We propose a novel metric which evaluates DAs according to their capacity to maximize the Contextual Mutual Information--the information not directly related to an image's label--while also ensuring the predictions for each class are close to the ground truth labels on average. The proposed method automatically ranks and selects DAs, requiring minimal training overhead, and it is compatible with any KD or QAT algorithm. Extensive evaluations demonstrate that selecting DA strategies using our metric significantly improves state-of-the-art QAT and KD works across various model architectures and datasets.

Related papers

Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.<n>In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.<n>Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z)
Self-Supervised Quantization-Aware Knowledge Distillation [5.4714555711042]
This paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures.
arXiv Detail & Related papers (2024-03-17T06:20:28Z)
Practical Insights into Knowledge Distillation for Pre-Trained Models [7.248285042377168]
This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models.<n>Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD's application is lacking.<n>Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD.
arXiv Detail & Related papers (2024-02-22T19:07:08Z)
ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift [7.256448072529497]
Knowledge Distillation (KD) transfers knowledge from large models to small models and has recently achieved remarkable success.<n>However, the reliability of existing KD methods in real-world applications, especially under distribution shift, remains underexplored.<n>We propose a unified and systematic framework textscShiftKD to benchmark KD against two general distributional shifts.
arXiv Detail & Related papers (2023-12-25T10:43:31Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation [10.899753512019933]
Knowledge Distillation (KD) aims to optimize a lightweight network. KD mainly involves knowledge extraction and distillation strategies. This paper provides a comprehensive KD survey, including knowledge categories, distillation schemes and algorithms.
arXiv Detail & Related papers (2023-06-19T03:42:44Z)
Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders [5.396898627891066]
We provide an in-depth analysis of the mechanism of KD on attention recovery of quantized large Transformers. We propose two KD methods; attention-map and attention-output losses. The experimental results on various Transformer encoder models demonstrate that the proposed KD methods achieve state-of-the-art accuracy for QAT with sub-2-bit weight quantization.
arXiv Detail & Related papers (2022-11-20T16:23:23Z)
Efficient training of lightweight neural networks using Online Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner. We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z)
KDExplainer: A Task-oriented Attention Model for Explaining Knowledge Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD. We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z)
Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics. We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains. Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z)
Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model. Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far. In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not. We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.