Momentum Adversarial Distillation: Handling Large Distribution Shifts in
Data-Free Knowledge Distillation
- URL: http://arxiv.org/abs/2209.10359v1
- Date: Wed, 21 Sep 2022 13:53:56 GMT
- Title: Momentum Adversarial Distillation: Handling Large Distribution Shifts in
Data-Free Knowledge Distillation
- Authors: Kien Do, Hung Le, Dung Nguyen, Dang Nguyen, Haripriya Harikumar,
Truyen Tran, Santu Rana, Svetha Venkatesh
- Abstract summary: We propose a simple yet effective method called Momentum Adversarial Distillation (MAD)
MAD maintains an exponential moving average (EMA) copy of the generator and uses synthetic samples from both the generator and the EMA generator to train the student.
Our experiments on six benchmark datasets including big datasets like ImageNet and Places365 demonstrate the superior performance of MAD over competing methods.
- Score: 65.28708064066764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-free Knowledge Distillation (DFKD) has attracted attention recently
thanks to its appealing capability of transferring knowledge from a teacher
network to a student network without using training data. The main idea is to
use a generator to synthesize data for training the student. As the generator
gets updated, the distribution of synthetic data will change. Such distribution
shift could be large if the generator and the student are trained
adversarially, causing the student to forget the knowledge it acquired at
previous steps. To alleviate this problem, we propose a simple yet effective
method called Momentum Adversarial Distillation (MAD) which maintains an
exponential moving average (EMA) copy of the generator and uses synthetic
samples from both the generator and the EMA generator to train the student.
Since the EMA generator can be considered as an ensemble of the generator's old
versions and often undergoes a smaller change in updates compared to the
generator, training on its synthetic samples can help the student recall the
past knowledge and prevent the student from adapting too quickly to new updates
of the generator. Our experiments on six benchmark datasets including big
datasets like ImageNet and Places365 demonstrate the superior performance of
MAD over competing methods for handling the large distribution shift problem.
Our method also compares favorably to existing DFKD methods and even achieves
state-of-the-art results in some cases.
Related papers
- Multi-student Diffusion Distillation for Better One-step Generators [29.751205880199855]
Multi-Student Distillation (MSD) is a framework to distill a conditional teacher diffusion model into multiple single-step generators.
MSD trains multiple distilled students, allowing smaller sizes and, therefore, faster inference.
Using 4 same-sized students, MSD sets a new state-of-the-art for one-step image generation: FID 1.20 on ImageNet-64x64 and 8.20 on zero-shot COCO2014.
arXiv Detail & Related papers (2024-10-30T17:54:56Z) - Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation [61.03530321578825]
We introduce Score identity Distillation (SiD), an innovative data-free method that distills the generative capabilities of pretrained diffusion models into a single-step generator.
SiD not only facilitates an exponentially fast reduction in Fr'echet inception distance (FID) during distillation but also approaches or even exceeds the FID performance of the original teacher diffusion models.
arXiv Detail & Related papers (2024-04-05T12:30:19Z) - NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation [42.435293471992274]
Data-Free Knowledge Distillation (DFKD) has made significant recent strides by transferring knowledge from a teacher neural network to a student neural network without accessing the original data.
Existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs, which inherently lack meaningful information.
We propose a novel Noisy Layer Generation method (NAYER) which relocates the random source from the input to a noisy layer and utilizes the meaningful constant label-text embedding (LTE) as the input.
arXiv Detail & Related papers (2023-09-30T05:19:10Z) - Improving Out-of-Distribution Robustness of Classifiers via Generative
Interpolation [56.620403243640396]
Deep neural networks achieve superior performance for learning from independent and identically distributed (i.i.d.) data.
However, their performance deteriorates significantly when handling out-of-distribution (OoD) data.
We develop a simple yet effective method called Generative Interpolation to fuse generative models trained from multiple domains for synthesizing diverse OoD samples.
arXiv Detail & Related papers (2023-07-23T03:53:53Z) - Dynamically Masked Discriminator for Generative Adversarial Networks [71.33631511762782]
Training Generative Adversarial Networks (GANs) remains a challenging problem.
Discriminator trains the generator by learning the distribution of real/generated data.
We propose a novel method for GANs from the viewpoint of online continual learning.
arXiv Detail & Related papers (2023-06-13T12:07:01Z) - Dual Discriminator Adversarial Distillation for Data-free Model
Compression [36.49964835173507]
We propose Dual Discriminator Adversarial Distillation (DDAD) to distill a neural network without any training data or meta-data.
To be specific, we use a generator to create samples through dual discriminator adversarial distillation, which mimics the original training data.
The proposed method obtains an efficient student network which closely approximates its teacher network, despite using no original training data.
arXiv Detail & Related papers (2021-04-12T12:01:45Z) - Generative Adversarial Simulator [2.3986080077861787]
We introduce a simulator-free approach to knowledge distillation in the context of reinforcement learning.
A key challenge is having the student learn the multiplicity of cases that correspond to a given action.
This is the first demonstration of simulator-free knowledge distillation between a teacher and a student policy.
arXiv Detail & Related papers (2020-11-23T15:31:12Z) - Data-Free Network Quantization With Adversarial Knowledge Distillation [39.92282726292386]
In this paper, we consider data-free network quantization with synthetic data.
The synthetic data are generated from a generator, while no data are used in training the generator and in quantization.
We show the gain of producing diverse adversarial samples by using multiple generators and multiple students.
arXiv Detail & Related papers (2020-05-08T16:24:55Z) - Data-Free Knowledge Amalgamation via Group-Stack Dual-GAN [80.17705319689139]
We propose a data-free knowledge amalgamate strategy to craft a well-behaved multi-task student network from multiple single/multi-task teachers.
The proposed method without any training data achieves the surprisingly competitive results, even compared with some full-supervised methods.
arXiv Detail & Related papers (2020-03-20T03:20:52Z) - Distilling portable Generative Adversarial Networks for Image
Translation [101.33731583985902]
Traditional network compression methods focus on visually recognition tasks, but never deal with generation tasks.
Inspired by knowledge distillation, a student generator of fewer parameters is trained by inheriting the low-level and high-level information from the original heavy teacher generator.
An adversarial learning process is established to optimize student generator and student discriminator.
arXiv Detail & Related papers (2020-03-07T05:53:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.