Subclass Distillation
- URL: http://arxiv.org/abs/2002.03936v2
- Date: Wed, 10 Jun 2020 18:32:14 GMT
- Title: Subclass Distillation
- Authors: Rafael M\"uller, Simon Kornblith, Geoffrey Hinton
- Abstract summary: We show that it is possible to transfer most of the generalization ability of a teacher to a student.
For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses.
For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.
- Score: 94.18870689772544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: After a large "teacher" neural network has been trained on labeled data, the
probabilities that the teacher assigns to incorrect classes reveal a lot of
information about the way in which the teacher generalizes. By training a small
"student" model to match these probabilities, it is possible to transfer most
of the generalization ability of the teacher to the student, often producing a
much better small model than directly training the student on the training
data. The transfer works best when there are many possible classes because more
is then revealed about the function learned by the teacher, but in cases where
there are only a few possible classes we show that we can improve the transfer
by forcing the teacher to divide each class into many subclasses that it
invents during the supervised training. The student is then trained to match
the subclass probabilities. For datasets where there are known, natural
subclasses we demonstrate that the teacher learns similar subclasses and these
improve distillation. For clickthrough datasets where the subclasses are
unknown we demonstrate that subclass distillation allows the student to learn
faster and better.
Related papers
- Subclass Knowledge Distillation with Known Subclass Labels [28.182027210008656]
Subclass Knowledge Distillation (SKD) is a process of transferring the knowledge of predicted subclasses from a teacher to a smaller student.
A lightweight, low-complexity student trained with the SKD framework achieves an F1-score of 85.05%, an improvement of 1.47%, and a 2.10% gain over the student that is trained with and without conventional knowledge distillation.
arXiv Detail & Related papers (2022-07-17T03:14:05Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Multi-Teacher Knowledge Distillation for Incremental Implicitly-Refined
Classification [37.14755431285735]
We propose a novel Multi-Teacher Knowledge Distillation (MTKD) strategy for incremental learning.
To preserve the superclass knowledge, we use the initial model as a superclass teacher to distill the superclass knowledge for the student model.
We propose a post-processing mechanism, called as Top-k prediction restriction to reduce the redundant predictions.
arXiv Detail & Related papers (2022-02-23T09:51:40Z) - Long-tail Recognition via Compositional Knowledge Transfer [60.03764547406601]
We introduce a novel strategy for long-tail recognition that addresses the tail classes' few-shot problem.
Our objective is to transfer knowledge acquired from information-rich common classes to semantically similar, and yet data-hungry, rare classes.
Experiments show that our approach can achieve significant performance boosts on rare classes while maintaining robust common class performance.
arXiv Detail & Related papers (2021-12-13T15:48:59Z) - On the Efficiency of Subclass Knowledge Distillation in Classification
Tasks [33.1278647424578]
Subclass Knowledge Distillation (SKD) framework is a process of transferring the subclasses' prediction knowledge from a large teacher model into a smaller student one.
The framework is evaluated in clinical application, namely colorectal polyp binary classification.
A lightweight, low complexity student trained with the proposed framework achieves an F1-score of 85.05%, an improvement of 2.14% and 1.49% gain over the student that trains without.
arXiv Detail & Related papers (2021-09-12T19:04:44Z) - Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance.
Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z) - Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.