Students are the Best Teacher: Exit-Ensemble Distillation with
Multi-Exits
- URL: http://arxiv.org/abs/2104.00299v2
- Date: Mon, 5 Apr 2021 01:13:20 GMT
- Title: Students are the Best Teacher: Exit-Ensemble Distillation with
Multi-Exits
- Authors: Hojung Lee, Jong-Seok Lee
- Abstract summary: This paper proposes a novel knowledge distillation-based learning method to improve the classification performance of convolutional neural networks (CNNs)
Unlike the conventional notion of distillation where teachers only teach students, we show that students can also help other students and even the teacher to learn better.
- Score: 25.140055086630838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a novel knowledge distillation-based learning method to
improve the classification performance of convolutional neural networks (CNNs)
without a pre-trained teacher network, called exit-ensemble distillation. Our
method exploits the multi-exit architecture that adds auxiliary classifiers
(called exits) in the middle of a conventional CNN, through which early
inference results can be obtained. The idea of our method is to train the
network using the ensemble of the exits as the distillation target, which
greatly improves the classification performance of the overall network. Our
method suggests a new paradigm of knowledge distillation; unlike the
conventional notion of distillation where teachers only teach students, we show
that students can also help other students and even the teacher to learn
better. Experimental results demonstrate that our method achieves significant
improvement of classification performance on various popular CNN architectures
(VGG, ResNet, ResNeXt, WideResNet, etc.). Furthermore, the proposed method can
expedite the convergence of learning with improved stability. Our code will be
available on Github.
Related papers
- AICSD: Adaptive Inter-Class Similarity Distillation for Semantic
Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.
The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs.
Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z) - On effects of Knowledge Distillation on Transfer Learning [0.0]
We propose a machine learning architecture we call TL+KD that combines knowledge distillation with transfer learning.
We show that using guidance and knowledge from a larger teacher network during fine-tuning, we can improve the student network to achieve better validation performances like accuracy.
arXiv Detail & Related papers (2022-10-18T08:11:52Z) - Meta Learning for Knowledge Distillation [12.716258111815312]
We show the teacher network can learn to better transfer knowledge to the student network.
We introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner.
arXiv Detail & Related papers (2021-06-08T17:59:03Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Knowledge Distillation By Sparse Representation Matching [107.87219371697063]
We propose Sparse Representation Matching (SRM) to transfer intermediate knowledge from one Convolutional Network (CNN) to another by utilizing sparse representation.
We formulate as a neural processing block, which can be efficiently optimized using gradient descent and integrated into any CNN in a plug-and-play manner.
Our experiments demonstrate that is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets.
arXiv Detail & Related papers (2021-03-31T11:47:47Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge
Distillation [12.097302014936655]
This paper proposes a novel self-knowledge distillation method, Feature Refinement via Self-Knowledge Distillation (FRSKD)
Our proposed method, FRSKD, can utilize both soft label and feature-map distillations for the self-knowledge distillation.
We demonstrate the effectiveness of FRSKD by enumerating its performance improvements in diverse tasks and benchmark datasets.
arXiv Detail & Related papers (2021-03-15T10:59:43Z) - Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural
Architecture Search [60.965024145243596]
One-shot weight sharing methods have recently drawn great attention in neural architecture search due to high efficiency and competitive performance.
To alleviate this problem, we present a simple yet effective architecture distillation method.
We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training.
Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop.
arXiv Detail & Related papers (2020-10-29T17:55:05Z) - Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation.
In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation.
Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z) - Distilling Knowledge from Graph Convolutional Networks [146.71503336770886]
Existing knowledge distillation methods focus on convolutional neural networks (CNNs)
We propose the first dedicated approach to distilling knowledge from a pre-trained graph convolutional network (GCN) model.
We show that our method achieves the state-of-the-art knowledge distillation performance for GCN models.
arXiv Detail & Related papers (2020-03-23T18:23:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.