Distilling Knowledge via Intermediate Classifier Heads
- URL: http://arxiv.org/abs/2103.00497v1
- Date: Sun, 28 Feb 2021 12:52:52 GMT
- Title: Distilling Knowledge via Intermediate Classifier Heads
- Authors: Aryan Asadian, Amirali Salehi-Abari
- Abstract summary: Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
- Score: 0.5584060970507505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The crux of knowledge distillation -- as a transfer-learning approach -- is
to effectively train a resource-limited student model with the guide of a
pre-trained larger teacher model. However, when there is a large difference
between the model complexities of teacher and student (i.e., capacity gap),
knowledge distillation loses its strength in transferring knowledge from the
teacher to the student, thus training a weaker student. To mitigate the impact
of the capacity gap, we introduce knowledge distillation via intermediate
heads. By extending the intermediate layers of the teacher (at various depths)
with classifier heads, we cheaply acquire a cohort of heterogeneous pre-trained
teachers. The intermediate classifier heads can all together be efficiently
learned while freezing the backbone of the pre-trained teacher. The cohort of
teachers (including the original teacher) co-teach the student simultaneously.
Our experiments on various teacher-student pairs and datasets have demonstrated
that the proposed approach outperforms the canonical knowledge distillation
approach and its extensions.
Related papers
- Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation [11.754014876977422]
This paper introduces a novel perspective emphasizing student-oriented and refining the teacher's knowledge to better align with the student's needs.
We present the Student-Oriented Knowledge Distillation (SoKD), which incorporates a learnable feature augmentation strategy during training.
We also deploy the Distinctive Area Detection Module (DAM) to identify areas of mutual interest between the teacher and student.
arXiv Detail & Related papers (2024-09-27T14:34:08Z) - Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Unified and Effective Ensemble Knowledge Distillation [92.67156911466397]
Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model.
Many existing methods learn and distill the student model on labeled data only.
We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
arXiv Detail & Related papers (2022-04-01T16:15:39Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Student Network Learning via Evolutionary Knowledge Distillation [22.030934154498205]
We propose an evolutionary knowledge distillation approach to improve the transfer effectiveness of teacher knowledge.
Instead of a fixed pre-trained teacher, an evolutionary teacher is learned online and consistently transfers intermediate knowledge to supervise student network learning on-the-fly.
In this way, the student can simultaneously obtain rich internal knowledge and capture its growth process, leading to effective student network learning.
arXiv Detail & Related papers (2021-03-23T02:07:15Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.