Student Helping Teacher: Teacher Evolution via Self-Knowledge
Distillation
- URL: http://arxiv.org/abs/2110.00329v1
- Date: Fri, 1 Oct 2021 11:46:12 GMT
- Title: Student Helping Teacher: Teacher Evolution via Self-Knowledge
Distillation
- Authors: Zheng Li, Xiang Li, Lingfeng Zhang, Jian Yang, Zhigeng Pan
- Abstract summary: We propose a novel student-helping-teacher formula, Teacher Evolution via Self-Knowledge Distillation (TESKD), where the target teacher is learned with the help of multiple hierarchical students by sharing the structural backbone.
The effectiveness of our proposed framework is demonstrated by extensive experiments with various network settings on two standard benchmarks including CIFAR-100 and ImageNet.
- Score: 20.17325172100031
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation usually transfers the knowledge from a pre-trained
cumbersome teacher network to a compact student network, which follows the
classical teacher-teaching-student paradigm. Based on this paradigm, previous
methods mostly focus on how to efficiently train a better student network for
deployment. Different from the existing practices, in this paper, we propose a
novel student-helping-teacher formula, Teacher Evolution via Self-Knowledge
Distillation (TESKD), where the target teacher (for deployment) is learned with
the help of multiple hierarchical students by sharing the structural backbone.
The diverse feedback from multiple students allows the teacher to improve
itself through the shared feature representations. The effectiveness of our
proposed framework is demonstrated by extensive experiments with various
network settings on two standard benchmarks including CIFAR-100 and ImageNet.
Notably, when trained together with our proposed method, ResNet-18 achieves
79.15% and 71.14% accuracy on CIFAR-100 and ImageNet, outperforming the
baseline results by 4.74% and 1.43%, respectively. The code is available at:
https://github.com/zhengli427/TESKD.
Related papers
- Adaptive Teaching with Shared Classifier for Knowledge Distillation [6.03477652126575]
Knowledge distillation (KD) is a technique used to transfer knowledge from a teacher network to a student network.
We propose adaptive teaching with a shared classifier (ATSC)
Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios.
arXiv Detail & Related papers (2024-06-12T08:51:08Z) - Low-complexity deep learning frameworks for acoustic scene
classification using teacher-student scheme and multiple spectrograms [59.86658316440461]
The proposed system comprises two main phases: (Phase I) Training a teacher network; and (Phase II) training a student network using distilled knowledge from the teacher.
Our experiments conducted on DCASE 2023 Task 1 Development dataset have fulfilled the requirement of low-complexity and achieved the best classification accuracy of 57.4%.
arXiv Detail & Related papers (2023-05-16T14:21:45Z) - ORC: Network Group-based Knowledge Distillation using Online Role Change [3.735965959270874]
We propose an online role change strategy for multiple teacher-based knowledge distillations.
The top-ranked networks in the student group are able to promote to the teacher group at every iteration.
We verify the superiority of the proposed method on CIFAR-10, CIFAR-100, and ImageNet.
arXiv Detail & Related papers (2022-06-01T10:28:18Z) - Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For
Model Compression [2.538209532048867]
Mutual Learning (ML) provides an alternative strategy where multiple simple student networks benefit from sharing knowledge.
We propose a single-teacher, multi-student framework that leverages both KD and ML to achieve better performance.
arXiv Detail & Related papers (2021-10-21T09:59:31Z) - Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance.
Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Densely Guided Knowledge Distillation using Multiple Teacher Assistants [5.169724825219126]
We propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size.
We also design teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped.
This acts as a regularizer to improve the efficiency of teaching of the student network.
arXiv Detail & Related papers (2020-09-18T13:12:52Z) - Point Adversarial Self Mining: A Simple Method for Facial Expression
Recognition [79.75964372862279]
We propose Point Adversarial Self Mining (PASM) to improve the recognition accuracy in facial expression recognition.
PASM uses a point adversarial attack method and a trained teacher network to locate the most informative position related to the target task.
The adaptive learning materials generation and teacher/student update can be conducted more than one time, improving the network capability iteratively.
arXiv Detail & Related papers (2020-08-26T06:39:24Z) - Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation.
In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation.
Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z) - Teacher-Class Network: A Neural Network Compression Mechanism [2.257416403770908]
Instead of transferring knowledge to one student only, the proposed method transfers a chunk of knowledge to each student.
Our students are not trained for problem-specific logits, they are trained to mimic knowledge (dense representation) learned by the teacher network.
The proposed teacher-class architecture is evaluated on several benchmark datasets such as MNIST, Fashion MNIST, IMDB Movie Reviews, CAMVid, CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-04-07T11:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.