Related papers: Momentum Adversarial Distillation: Handling Large Distribution Shifts in Data-Free Knowledge Distillation

Momentum Adversarial Distillation: Handling Large Distribution Shifts in Data-Free Knowledge Distillation

URL: http://arxiv.org/abs/2209.10359v1
Date: Wed, 21 Sep 2022 13:53:56 GMT
Title: Momentum Adversarial Distillation: Handling Large Distribution Shifts in Data-Free Knowledge Distillation
Authors: Kien Do, Hung Le, Dung Nguyen, Dang Nguyen, Haripriya Harikumar, Truyen Tran, Santu Rana, Svetha Venkatesh
Abstract summary: We propose a simple yet effective method called Momentum Adversarial Distillation (MAD) MAD maintains an exponential moving average (EMA) copy of the generator and uses synthetic samples from both the generator and the EMA generator to train the student. Our experiments on six benchmark datasets including big datasets like ImageNet and Places365 demonstrate the superior performance of MAD over competing methods.
Score: 65.28708064066764
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data-free Knowledge Distillation (DFKD) has attracted attention recently thanks to its appealing capability of transferring knowledge from a teacher network to a student network without using training data. The main idea is to use a generator to synthesize data for training the student. As the generator gets updated, the distribution of synthetic data will change. Such distribution shift could be large if the generator and the student are trained adversarially, causing the student to forget the knowledge it acquired at previous steps. To alleviate this problem, we propose a simple yet effective method called Momentum Adversarial Distillation (MAD) which maintains an exponential moving average (EMA) copy of the generator and uses synthetic samples from both the generator and the EMA generator to train the student. Since the EMA generator can be considered as an ensemble of the generator's old versions and often undergoes a smaller change in updates compared to the generator, training on its synthetic samples can help the student recall the past knowledge and prevent the student from adapting too quickly to new updates of the generator. Our experiments on six benchmark datasets including big datasets like ImageNet and Places365 demonstrate the superior performance of MAD over competing methods for handling the large distribution shift problem. Our method also compares favorably to existing DFKD methods and even achieves state-of-the-art results in some cases.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Hybrid Data-Free Knowledge Distillation [11.773963069904955]
We propose a data-free knowledge distillation method called textbfHybrtextbfid textbfData-textbfFree textbfDistillation (HiDFD) Our HiDFD can achieve state-of-the-art performance using 120 times less collected data than existing methods.
arXiv Detail & Related papers (2024-12-18T05:52:16Z)
Multi-student Diffusion Distillation for Better One-step Generators [29.751205880199855]
Multi-Student Distillation (MSD) is a framework to distill a conditional teacher diffusion model into multiple single-step generators. MSD trains multiple distilled students, allowing smaller sizes and, therefore, faster inference. Using 4 same-sized students, MSD sets a new state-of-the-art for one-step image generation: FID 1.20 on ImageNet-64x64 and 8.20 on zero-shot COCO2014.
arXiv Detail & Related papers (2024-10-30T17:54:56Z)
Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation [61.03530321578825]
We introduce Score identity Distillation (SiD), an innovative data-free method that distills the generative capabilities of pretrained diffusion models into a single-step generator. SiD not only facilitates an exponentially fast reduction in Fr'echet inception distance (FID) during distillation but also approaches or even exceeds the FID performance of the original teacher diffusion models.
arXiv Detail & Related papers (2024-04-05T12:30:19Z)
NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation [42.435293471992274]
Data-Free Knowledge Distillation (DFKD) has made significant recent strides by transferring knowledge from a teacher neural network to a student neural network without accessing the original data. Existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs, which inherently lack meaningful information. We propose a novel Noisy Layer Generation method (NAYER) which relocates the random source from the input to a noisy layer and utilizes the meaningful constant label-text embedding (LTE) as the input.
arXiv Detail & Related papers (2023-09-30T05:19:10Z)
Improving Out-of-Distribution Robustness of Classifiers via Generative Interpolation [56.620403243640396]
Deep neural networks achieve superior performance for learning from independent and identically distributed (i.i.d.) data. However, their performance deteriorates significantly when handling out-of-distribution (OoD) data. We develop a simple yet effective method called Generative Interpolation to fuse generative models trained from multiple domains for synthesizing diverse OoD samples.
arXiv Detail & Related papers (2023-07-23T03:53:53Z)
Dynamically Masked Discriminator for Generative Adversarial Networks [71.33631511762782]
Training Generative Adversarial Networks (GANs) remains a challenging problem. Discriminator trains the generator by learning the distribution of real/generated data. We propose a novel method for GANs from the viewpoint of online continual learning.
arXiv Detail & Related papers (2023-06-13T12:07:01Z)
Dual Discriminator Adversarial Distillation for Data-free Model Compression [36.49964835173507]
We propose Dual Discriminator Adversarial Distillation (DDAD) to distill a neural network without any training data or meta-data. To be specific, we use a generator to create samples through dual discriminator adversarial distillation, which mimics the original training data. The proposed method obtains an efficient student network which closely approximates its teacher network, despite using no original training data.
arXiv Detail & Related papers (2021-04-12T12:01:45Z)
Generative Adversarial Simulator [2.3986080077861787]
We introduce a simulator-free approach to knowledge distillation in the context of reinforcement learning. A key challenge is having the student learn the multiplicity of cases that correspond to a given action. This is the first demonstration of simulator-free knowledge distillation between a teacher and a student policy.
arXiv Detail & Related papers (2020-11-23T15:31:12Z)
Data-Free Network Quantization With Adversarial Knowledge Distillation [39.92282726292386]
In this paper, we consider data-free network quantization with synthetic data. The synthetic data are generated from a generator, while no data are used in training the generator and in quantization. We show the gain of producing diverse adversarial samples by using multiple generators and multiple students.
arXiv Detail & Related papers (2020-05-08T16:24:55Z)
Data-Free Knowledge Amalgamation via Group-Stack Dual-GAN [80.17705319689139]
We propose a data-free knowledge amalgamate strategy to craft a well-behaved multi-task student network from multiple single/multi-task teachers. The proposed method without any training data achieves the surprisingly competitive results, even compared with some full-supervised methods.
arXiv Detail & Related papers (2020-03-20T03:20:52Z)
Distilling portable Generative Adversarial Networks for Image Translation [101.33731583985902]
Traditional network compression methods focus on visually recognition tasks, but never deal with generation tasks. Inspired by knowledge distillation, a student generator of fewer parameters is trained by inheriting the low-level and high-level information from the original heavy teacher generator. An adversarial learning process is established to optimize student generator and student discriminator.
arXiv Detail & Related papers (2020-03-07T05:53:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.