Related papers: Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer

Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer

URL: http://arxiv.org/abs/2509.20854v1
Date: Thu, 25 Sep 2025 07:43:13 GMT
Title: Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer
Authors: Abdur Rehman, S M A Sharif, Md Abdur Rahaman, Mohamed Jismy Aashik Rasool, Seongwan Kim, Jaeho Lee,
Abstract summary: Game of Regularizer (GoR) is a learnable regularization method that adaptively balances task-specific (TS) and distillation losses.<n>GoR consistently outperforms state-of-the-art QAT-KD methods on low-power edge devices.<n>We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models.
Score: 9.85847764731154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.

Related papers

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield [54.328202401611264]
Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators.<n>We show that the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA)<n>We propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains.
arXiv Detail & Related papers (2025-11-27T18:24:28Z)
Extreme Model Compression with Structured Sparsity at Low Precision [10.976782748075067]
Deep neural networks (DNNs) are used in many applications, but their large size and high computational cost make them hard to run on devices with limited resources.<n>Two widely used techniques to address this challenge are weight quantization, which lowers the precision of all weights, and structured sparsity, which removes unimportant weights while retaining the important ones at full precision.<n>We introduce SLOPE Structured Sparsity at Low Precision, a unified framework, to effectively combine structured sparsity and low-bit quantization in a principled way.
arXiv Detail & Related papers (2025-11-11T15:37:55Z)
Progressive Element-wise Gradient Estimation for Neural Network Quantization [2.1413624861650358]
Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions.<n>We propose Progressive Element-wise Gradient Estimation (PEGE) to address discretization errors between continuous and quantized values.<n>PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
arXiv Detail & Related papers (2025-08-27T15:59:36Z)
QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
PQCAD-DM: Progressive Quantization and Calibration-Assisted Distillation for Extremely Efficient Diffusion Model [8.195126516665914]
Diffusion models excel in image generation but are computational and resource-intensive.<n>We propose PQCAD-DM, a novel hybrid compression framework combining Progressive Quantization (PQ) and CAD-Assisted Distillation (CAD)<n>PQ employs a two-stage quantization with adaptive bit-width transitions guided by a momentum-based mechanism, reducing excessive weight perturbations in low-precision.
arXiv Detail & Related papers (2025-06-20T06:43:27Z)
Lightweight Task-Oriented Semantic Communication Empowered by Large-Scale AI Models [66.57755931421285]
Large-scale artificial intelligence (LAI) models pose significant challenges for real-time communication scenarios.<n>This paper proposes utilizing knowledge distillation (KD) techniques to extract and condense knowledge from LAI models.<n>We propose a fast distillation method featuring a pre-stored compression mechanism that eliminates the need for repetitive inference.
arXiv Detail & Related papers (2025-06-16T08:42:16Z)
Self-Supervised Quantization-Aware Knowledge Distillation [5.4714555711042]
This paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures.
arXiv Detail & Related papers (2024-03-17T06:20:28Z)
Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models. The underlying mechanics behind knowledge distillation (KD) are still not fully understood. We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z)
Unifying Synergies between Self-supervised Learning and Dynamic Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms. We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting. The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z)
BD-KD: Balancing the Divergences for Online Knowledge Distillation [11.874952582465601]
We introduce BD-KD (Balanced Divergence Knowledge Distillation), a framework for logit-based online KD.<n> BD-KD enhances both accuracy and model calibration simultaneously, eliminating the need for post-hoc recalibration techniques.<n>Our method encourages student-centered training by adjusting the conventional online distillation loss on both the student and teacher losses.
arXiv Detail & Related papers (2022-12-25T22:27:32Z)
Self-Distillation from the Last Mini-Batch for Consistency Regularization [14.388479145440636]
We propose an efficient and reliable self-distillation framework, named Self-Distillation from Last Mini-Batch (DLB) Our proposed mechanism guides the training stability and consistency, resulting in robustness to label noise. Experimental results on three classification benchmarks illustrate that our approach can consistently outperform state-of-the-art self-distillation approaches.
arXiv Detail & Related papers (2022-03-30T09:50:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.