Related papers: Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay

Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay

URL: http://arxiv.org/abs/2201.03019v3
Date: Mon, 29 Jul 2024 13:57:56 GMT
Title: Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay
Authors: Kuluhan Binici, Shivam Aggarwal, Nam Trung Pham, Karianto Leman, Tulika Mitra,
Abstract summary: Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data. Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process. However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy. This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data. We propose to model the distribution of the previously observed synthetic samples
Score: 5.3330804968579795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data. Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process. However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy. Therefore, a practical data-free KD method should be robust and ideally provide monotonically increasing student accuracy during distillation. This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data. A straightforward approach to overcome this issue is to store and rehearse the generated samples periodically, which increases the memory footprint and creates privacy concerns. We propose to model the distribution of the previously observed synthetic samples with a generative network. In particular, we design a Variational Autoencoder (VAE) with a training objective that is customized to learn the synthetic data representations optimally. The student is rehearsed by the generative pseudo replay technique, with samples produced by the VAE. Hence knowledge degradation can be prevented without storing any samples. Experiments on image classification benchmarks show that our method optimizes the expected value of the distilled model accuracy while eliminating the large memory overhead incurred by the sample-storing methods.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins [29.88235846291593]
Bi-encoders estimate relevance of a document to a query by calculating the similarity of their respective embeddings.<n>Current state-of-the-art bi-encoders are trained using an expensive training regime involving knowledge distillation from a teacher model and batch-sampling.<n>We propose a novel parameter-free loss function for self-supervision that exploits the pre-trained language modeling capabilities of the encoder model as a training signal.
arXiv Detail & Related papers (2024-07-31T10:33:32Z)
Small Scale Data-Free Knowledge Distillation [37.708282211941416]
We propose Small Scale Data-free Knowledge Distillation SSD-KD. SSD-KD balances synthetic samples and a priority sampling function to select proper samples. It can perform distillation training conditioned on an extremely small scale of synthetic samples.
arXiv Detail & Related papers (2024-06-12T05:09:41Z)
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few. We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z)
Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model. We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions. Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z)
Prompting to Distill: Boosting Data-Free Knowledge Distillation via Reinforced Prompt [52.6946016535059]
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data. We propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors. As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance.
arXiv Detail & Related papers (2022-05-16T08:56:53Z)
Conditional Generative Data-Free Knowledge Distillation based on Attention Transfer [0.8594140167290099]
We propose a conditional generative data-free knowledge distillation (CGDD) framework to train efficient portable network without any real data. In this framework, except using the knowledge extracted from teacher model, we introduce preset labels as additional auxiliary information. We show that trained portable network learned with proposed data-free distillation method obtains 99.63%, 99.07% and 99.84% relative accuracy on CIFAR10, CIFAR100 and Caltech101.
arXiv Detail & Related papers (2021-12-31T09:23:40Z)
Efficient training of lightweight neural networks using Online Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner. We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z)
Preventing Catastrophic Forgetting and Distribution Mismatch in Knowledge Distillation via Synthetic Data [5.064036314529226]
We propose a data-free KD framework that maintains a dynamic collection of generated samples over time. Our experiments demonstrate that we can improve the accuracy of the student models obtained via KD when compared with state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-11T08:11:08Z)
Dual Discriminator Adversarial Distillation for Data-free Model Compression [36.49964835173507]
We propose Dual Discriminator Adversarial Distillation (DDAD) to distill a neural network without any training data or meta-data. To be specific, we use a generator to create samples through dual discriminator adversarial distillation, which mimics the original training data. The proposed method obtains an efficient student network which closely approximates its teacher network, despite using no original training data.
arXiv Detail & Related papers (2021-04-12T12:01:45Z)
Enhancing Data-Free Adversarial Distillation with Activation Regularization and Virtual Interpolation [19.778192371420793]
A data-free adversarial distillation framework deploys a generative network to transfer the teacher model's knowledge to the student model. We add an activation regularizer and a virtual adversarial method to improve the data generation efficiency. Our model's accuracy is 13.8% higher than the state-of-the-art data-free method on CIFAR-100.
arXiv Detail & Related papers (2021-02-23T11:37:40Z)
Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation. We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining. Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.