Teaching What You Should Teach: A Data-Based Distillation Method
- URL: http://arxiv.org/abs/2212.05422v6
- Date: Sat, 20 May 2023 11:55:10 GMT
- Title: Teaching What You Should Teach: A Data-Based Distillation Method
- Authors: Shitong Shao and Huanran Chen and Zhen Huang and Linrui Gong and Shuai
Wang and Xinxiao Wu
- Abstract summary: We introduce the "Teaching what you Should Teach" strategy into a knowledge distillation framework.
We propose a data-based distillation method named "TST" that searches for desirable augmented samples to assist in distilling more efficiently and rationally.
To be specific, we design a neural network-based data augmentation module with priori bias, which assists in finding what meets the teacher's strengths but the student's weaknesses.
- Score: 20.595460553747163
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In real teaching scenarios, an excellent teacher always teaches what he (or
she) is good at but the student is not. This gives the student the best
assistance in making up for his (or her) weaknesses and becoming a good one
overall. Enlightened by this, we introduce the "Teaching what you Should Teach"
strategy into a knowledge distillation framework, and propose a data-based
distillation method named "TST" that searches for desirable augmented samples
to assist in distilling more efficiently and rationally. To be specific, we
design a neural network-based data augmentation module with priori bias, which
assists in finding what meets the teacher's strengths but the student's
weaknesses, by learning magnitudes and probabilities to generate suitable data
samples. By training the data augmentation module and the generalized
distillation paradigm in turn, a student model is learned with excellent
generalization ability. To verify the effectiveness of our method, we conducted
extensive comparative experiments on object recognition, detection, and
segmentation tasks. The results on the CIFAR-10, ImageNet-1k, MS-COCO, and
Cityscapes datasets demonstrate that our method achieves state-of-the-art
performance on almost all teacher-student pairs. Furthermore, we conduct
visualization studies to explore what magnitudes and probabilities are needed
for the distillation process.
Related papers
- ReffAKD: Resource-efficient Autoencoder-based Knowledge Distillation [3.301728339780329]
We propose an innovative method to boost Knowledge Distillation efficiency without the need for resource-heavy teacher models.
In our work, we propose an efficient method for generating soft labels, thereby eliminating the need for a large teacher model.
Our experiments on various datasets, including CIFAR-100, Tiny Imagenet, and Fashion MNIST, demonstrate the superior resource efficiency of our approach.
arXiv Detail & Related papers (2024-04-15T15:54:30Z) - Let All be Whitened: Multi-teacher Distillation for Efficient Visual
Retrieval [57.17075479691486]
We propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval.
Our source code is released at https://github.com/Maryeon/whiten_mtd.
arXiv Detail & Related papers (2023-12-15T11:43:56Z) - Student-friendly Knowledge Distillation [1.5469452301122173]
We propose student-friendly knowledge distillation (SKD) to simplify teacher output into new knowledge representations.
SKD contains a softening processing and a learning simplifier.
The experimental results on the CIFAR-100 and ImageNet datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-05-18T11:44:30Z) - Improved knowledge distillation by utilizing backward pass knowledge in
neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression.
In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher.
We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Dual Discriminator Adversarial Distillation for Data-free Model
Compression [36.49964835173507]
We propose Dual Discriminator Adversarial Distillation (DDAD) to distill a neural network without any training data or meta-data.
To be specific, we use a generator to create samples through dual discriminator adversarial distillation, which mimics the original training data.
The proposed method obtains an efficient student network which closely approximates its teacher network, despite using no original training data.
arXiv Detail & Related papers (2021-04-12T12:01:45Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.