Knowledge Distillation with the Reused Teacher Classifier
- URL: http://arxiv.org/abs/2203.14001v1
- Date: Sat, 26 Mar 2022 06:28:46 GMT
- Title: Knowledge Distillation with the Reused Teacher Classifier
- Authors: Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, Chun
Chen
- Abstract summary: We show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap.
Our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.
- Score: 31.22117343316628
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Knowledge distillation aims to compress a powerful yet cumbersome teacher
model into a lightweight student model without much sacrifice of performance.
For this purpose, various approaches have been proposed over the past few
years, generally with elaborately designed knowledge representations, which in
turn increase the difficulty of model development and interpretation. In
contrast, we empirically show that a simple knowledge distillation technique is
enough to significantly narrow down the teacher-student performance gap. We
directly reuse the discriminative classifier from the pre-trained teacher model
for student inference and train a student encoder through feature alignment
with a single $\ell_2$ loss. In this way, the student model is able to achieve
exactly the same performance as the teacher model provided that their extracted
features are perfectly aligned. An additional projector is developed to help
the student encoder match with the teacher classifier, which renders our
technique applicable to various teacher and student architectures. Extensive
experiments demonstrate that our technique achieves state-of-the-art results at
the modest cost of compression ratio due to the added projector.
Related papers
- Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures [4.960025399247103]
Generic Teacher Network (GTN) is a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a finite pool of architectures.
Our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.
arXiv Detail & Related papers (2024-07-22T20:34:00Z) - Understanding the Effects of Projectors in Knowledge Distillation [31.882356225974632]
Even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance.
This paper investigates the implicit role that projectors play but so far have been overlooked.
Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance.
arXiv Detail & Related papers (2023-10-26T06:30:39Z) - Can a student Large Language Model perform as well as it's teacher? [0.0]
Knowledge distillation aims to transfer knowledge from a high-capacity "teacher" model to a streamlined "student" model.
This paper provides a comprehensive overview of the knowledge distillation paradigm.
arXiv Detail & Related papers (2023-10-03T20:34:59Z) - Student-friendly Knowledge Distillation [1.5469452301122173]
We propose student-friendly knowledge distillation (SKD) to simplify teacher output into new knowledge representations.
SKD contains a softening processing and a learning simplifier.
The experimental results on the CIFAR-100 and ImageNet datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-05-18T11:44:30Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.