Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model
- URL: http://arxiv.org/abs/2106.03310v1
- Date: Mon, 7 Jun 2021 02:46:31 GMT
- Title: Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model
- Authors: Zi Wang
- Abstract summary: Knowledge distillation is a successful approach for deep neural network acceleration.
In tradition, KD usually relies on access to the training samples and the parameters of the white-box teacher to acquire the transferred knowledge.
Here we propose the concept of decision-based black-box (DB3) knowledge distillation, with which the student is trained by distilling the knowledge from a black-box teacher.
- Score: 8.87104231451079
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) is a successful approach for deep neural network
acceleration, with which a compact network (student) is trained by mimicking
the softmax output of a pre-trained high-capacity network (teacher). In
tradition, KD usually relies on access to the training samples and the
parameters of the white-box teacher to acquire the transferred knowledge.
However, these prerequisites are not always realistic due to storage costs or
privacy issues in real-world applications. Here we propose the concept of
decision-based black-box (DB3) knowledge distillation, with which the student
is trained by distilling the knowledge from a black-box teacher (parameters are
not accessible) that only returns classes rather than softmax outputs. We start
with the scenario when the training set is accessible. We represent a sample's
robustness against other classes by computing its distances to the teacher's
decision boundaries and use it to construct the soft label for each training
sample. After that, the student can be trained via standard KD. We then extend
this approach to a more challenging scenario in which even accessing the
training data is not feasible. We propose to generate pseudo samples
distinguished by the teacher's decision boundaries to the largest extent and
construct soft labels for them, which are used as the transfer set. We evaluate
our approaches on various benchmark networks and datasets and experiment
results demonstrate their effectiveness. Codes are available at:
https://github.com/zwang84/zsdb3kd.
Related papers
- Improved knowledge distillation by utilizing backward pass knowledge in
neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression.
In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher.
We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z) - Black-box Few-shot Knowledge Distillation [55.27881513982002]
Knowledge distillation (KD) is an efficient approach to transfer the knowledge from a large "teacher" network to a smaller "student" network.
We propose a black-box few-shot KD method to train the student with few unlabeled training samples and a black-box teacher.
We conduct extensive experiments to show that our method significantly outperforms recent SOTA few/zero-shot KD methods on image classification tasks.
arXiv Detail & Related papers (2022-07-25T12:16:53Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data [56.29595334715237]
Knowledge distillation(KD) aims to craft a compact student model that imitates the behavior of a pre-trained teacher in a target domain.
We introduce a handy yet surprisingly efficacious approach, dubbed astextitMosaicKD.
In MosaicKD, this is achieved through a four-player min-max game, in which a generator, a discriminator, a student network, are collectively trained in an adversarial manner.
arXiv Detail & Related papers (2021-10-27T13:01:10Z) - Beyond Classification: Knowledge Distillation using Multi-Object
Impressions [17.214664783818687]
Knowledge Distillation (KD) utilizes training data as a transfer set to transfer knowledge from a complex network (Teacher) to a smaller network (Student)
Several works have recently identified many scenarios where the training data may not be available due to data privacy or sensitivity concerns.
We, for the first time, solve a much more challenging problem, i.e., "KD for object detection with zero knowledge about the training data and its statistics"
arXiv Detail & Related papers (2021-10-27T06:59:27Z) - Efficient training of lightweight neural networks using Online
Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner.
We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Towards Zero-Shot Knowledge Distillation for Natural Language Processing [9.223848704267088]
Knowledge Distillation (KD) is a common algorithm used for model compression across a variety of deep learning based natural language processing (NLP) solutions.
In its regular manifestations, KD requires access to the teacher's training data for knowledge transfer to the student network.
We present to the best of our knowledge, the first work on Zero-Shot Knowledge Distillation for NLP, where the student learns from the much larger teacher without any task specific data.
arXiv Detail & Related papers (2020-12-31T08:16:29Z) - Progressive Network Grafting for Few-Shot Knowledge Distillation [60.38608462158474]
We introduce a principled dual-stage distillation scheme tailored for few-shot data.
In the first step, we graft the student blocks one by one onto the teacher, and learn the parameters of the grafted block intertwined with those of the other teacher blocks.
Experiments demonstrate that our approach, with only a few unlabeled samples, achieves gratifying results on CIFAR10, CIFAR100, and ILSVRC-2012.
arXiv Detail & Related papers (2020-12-09T08:34:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.