Densely Guided Knowledge Distillation using Multiple Teacher Assistants
- URL: http://arxiv.org/abs/2009.08825v3
- Date: Mon, 9 Aug 2021 05:48:48 GMT
- Title: Densely Guided Knowledge Distillation using Multiple Teacher Assistants
- Authors: Wonchul Son and Jaemin Na and Junyong Choi and Wonjun Hwang
- Abstract summary: We propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size.
We also design teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped.
This acts as a regularizer to improve the efficiency of teaching of the student network.
- Score: 5.169724825219126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the success of deep neural networks, knowledge distillation which guides
the learning of a small student network from a large teacher network is being
actively studied for model compression and transfer learning. However, few
studies have been performed to resolve the poor learning issue of the student
network when the student and teacher model sizes significantly differ. In this
paper, we propose a densely guided knowledge distillation using multiple
teacher assistants that gradually decreases the model size to efficiently
bridge the large gap between the teacher and student networks. To stimulate
more efficient learning of the student network, we guide each teacher assistant
to every other smaller teacher assistants iteratively. Specifically, when
teaching a smaller teacher assistant at the next step, the existing larger
teacher assistants from the previous step are used as well as the teacher
network. Moreover, we design stochastic teaching where, for each mini-batch, a
teacher or teacher assistants are randomly dropped. This acts as a regularizer
to improve the efficiency of teaching of the student network. Thus, the student
can always learn salient distilled knowledge from the multiple sources. We
verified the effectiveness of the proposed method for a classification task
using CIFAR-10, CIFAR-100, and ImageNet. We also achieved significant
performance improvements with various backbone architectures such as ResNet,
WideResNet, and VGG.
Related papers
- Adaptive Teaching with Shared Classifier for Knowledge Distillation [6.03477652126575]
Knowledge distillation (KD) is a technique used to transfer knowledge from a teacher network to a student network.
We propose adaptive teaching with a shared classifier (ATSC)
Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios.
arXiv Detail & Related papers (2024-06-12T08:51:08Z) - Knowledge Distillation via Weighted Ensemble of Teaching Assistants [18.593268785143426]
Knowledge distillation is the process of transferring knowledge from a large model called the teacher to a smaller model called the student.
When the network size gap between the teacher and student increases, the performance of the student network decreases.
We have shown that using multiple teaching assistant models, the student model (the smaller model) can be further improved.
arXiv Detail & Related papers (2022-06-23T22:50:05Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - ORC: Network Group-based Knowledge Distillation using Online Role Change [3.735965959270874]
We propose an online role change strategy for multiple teacher-based knowledge distillations.
The top-ranked networks in the student group are able to promote to the teacher group at every iteration.
We verify the superiority of the proposed method on CIFAR-10, CIFAR-100, and ImageNet.
arXiv Detail & Related papers (2022-06-01T10:28:18Z) - Iterative Teacher-Aware Learning [136.05341445369265]
In human pedagogy, teachers and students can interact adaptively to maximize communication efficiency.
We propose a gradient optimization based teacher-aware learner who can incorporate teacher's cooperative intention into the likelihood function.
arXiv Detail & Related papers (2021-10-01T00:27:47Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z) - Point Adversarial Self Mining: A Simple Method for Facial Expression
Recognition [79.75964372862279]
We propose Point Adversarial Self Mining (PASM) to improve the recognition accuracy in facial expression recognition.
PASM uses a point adversarial attack method and a trained teacher network to locate the most informative position related to the target task.
The adaptive learning materials generation and teacher/student update can be conducted more than one time, improving the network capability iteratively.
arXiv Detail & Related papers (2020-08-26T06:39:24Z) - Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation.
In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation.
Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.