Model compression using knowledge distillation with integrated gradients
- URL: http://arxiv.org/abs/2506.14440v1
- Date: Tue, 17 Jun 2025 12:00:23 GMT
- Title: Model compression using knowledge distillation with integrated gradients
- Authors: David E. Hernandez, Jose Chang, Torbjörn E. M. Nordling,
- Abstract summary: We introduce a novel method enhancing knowledge distillation with integrated gradients (IG)<n>Our approach overlays IG maps onto input images during training, providing student models with deeper insights into teacher models' decision-making processes.<n>Our method precomputes IG maps before training, transforming substantial runtime costs into a one-time preprocessing step.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model compression is critical for deploying deep learning models on resource-constrained devices. We introduce a novel method enhancing knowledge distillation with integrated gradients (IG) as a data augmentation strategy. Our approach overlays IG maps onto input images during training, providing student models with deeper insights into teacher models' decision-making processes. Extensive evaluation on CIFAR-10 demonstrates that our IG-augmented knowledge distillation achieves 92.6% testing accuracy with a 4.1x compression factor-a significant 1.1 percentage point improvement ($p<0.001$) over non-distilled models (91.5%). This compression reduces inference time from 140 ms to 13 ms. Our method precomputes IG maps before training, transforming substantial runtime costs into a one-time preprocessing step. Our comprehensive experiments include: (1) comparisons with attention transfer, revealing complementary benefits when combined with our approach; (2) Monte Carlo simulations confirming statistical robustness; (3) systematic evaluation of compression factor versus accuracy trade-offs across a wide range (2.2x-1122x); and (4) validation on an ImageNet subset aligned with CIFAR-10 classes, demonstrating generalisability beyond the initial dataset. These extensive ablation studies confirm that IG-based knowledge distillation consistently outperforms conventional approaches across varied architectures and compression ratios. Our results establish this framework as a viable compression technique for real-world deployment on edge devices while maintaining competitive accuracy.
Related papers
- Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients [0.0]
This paper proposes a machine learning framework that augments Knowledge Distillation (KD) with Integrated Gradients (IG)<n>We introduce a novel data augmentation strategy where IG maps, precomputed from a teacher model, are overlaid onto training images to guide a compact student model toward critical feature representations.<n>Experiments on CIFAR-10 demonstrate the efficacy of our method: a student model, compressed 4.1-fold from the MobileNet-V2 teacher, achieves 92.5% classification accuracy, surpassing the baseline student's 91.4% and traditional KD approaches, while reducing inference latency from 140 ms to 13 ms--a tenfold
arXiv Detail & Related papers (2025-03-17T10:07:50Z) - Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal [56.307484956135355]
CODiff is a compression-aware one-step diffusion model for JPEG artifact removal.<n>We propose a dual learning strategy that combines explicit and implicit learning.<n>Results demonstrate that CODiff surpasses recent leading methods in both quantitative and visual quality metrics.
arXiv Detail & Related papers (2025-02-14T02:46:27Z) - CALLIC: Content Adaptive Learning for Lossless Image Compression [64.47244912937204]
CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.<n>We propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations.<n>During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT)<n>RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time.
arXiv Detail & Related papers (2024-12-23T10:41:18Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Approximating Human-Like Few-shot Learning with GPT-based Compression [55.699707962017975]
We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference.
We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity.
arXiv Detail & Related papers (2023-08-14T05:22:33Z) - Enabling Deep Learning on Edge Devices through Filter Pruning and
Knowledge Transfer [5.239675888749389]
The paper proposes a novel filter-pruning-based model compression method to create lightweight trainable models from large models trained in the cloud.
Second, it proposes a novel knowledge transfer method to enable the on-device model to update incrementally in real time or near real time.
The results show that 1) our model compression method can remove up to 99.36% parameters of WRN-28-10, while preserving a Top-1 accuracy of over 90% on CIFAR-10.
arXiv Detail & Related papers (2022-01-22T00:27:21Z) - New Perspective on Progressive GANs Distillation for One-class Novelty
Detection [21.90786581579228]
Generative Adversarial Network based on thecoder-Decoder-Encoder scheme (EDE-GAN) achieves state-of-the-art performance.
New technology, Progressive Knowledge Distillation with GANs (P-KDGAN) connects two standard GANs through the designed distillation loss.
Two-step progressive learning continuously augments the performance of student GANs with improved results over single-step approach.
arXiv Detail & Related papers (2021-09-15T13:45:30Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - Online Ensemble Model Compression using Knowledge Distillation [51.59021417947258]
This paper presents a knowledge distillation based model compression framework consisting of a student ensemble.
It enables distillation of simultaneously learnt ensemble knowledge onto each of the compressed student models.
We provide comprehensive experiments using state-of-the-art classification models to validate our framework's effectiveness.
arXiv Detail & Related papers (2020-11-15T04:46:29Z) - Extracurricular Learning: Knowledge Transfer Beyond Empirical
Distribution [17.996541285382463]
We propose extracurricular learning to bridge the gap between a compressed student model and its teacher.
We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%.
This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures.
arXiv Detail & Related papers (2020-06-30T18:21:21Z) - Learning End-to-End Lossy Image Compression: A Benchmark [90.35363142246806]
We first conduct a comprehensive literature survey of learned image compression methods.
We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and provide insights into their historical development routes.
By introducing a coarse-to-fine hyperprior model for entropy estimation and signal reconstruction, we achieve improved rate-distortion performance.
arXiv Detail & Related papers (2020-02-10T13:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.