Related papers: Knowledge Distillation with the Reused Teacher Classifier

Knowledge Distillation with the Reused Teacher Classifier

URL: http://arxiv.org/abs/2203.14001v1
Date: Sat, 26 Mar 2022 06:28:46 GMT
Title: Knowledge Distillation with the Reused Teacher Classifier
Authors: Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, Chun Chen
Abstract summary: We show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap. Our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.
Score: 31.22117343316628
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years, generally with elaborately designed knowledge representations, which in turn increase the difficulty of model development and interpretation. In contrast, we empirically show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap. We directly reuse the discriminative classifier from the pre-trained teacher model for student inference and train a student encoder through feature alignment with a single $\ell_2$ loss. In this way, the student model is able to achieve exactly the same performance as the teacher model provided that their extracted features are perfectly aligned. An additional projector is developed to help the student encoder match with the teacher classifier, which renders our technique applicable to various teacher and student architectures. Extensive experiments demonstrate that our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
UNDO: Understanding Distillation as Optimization [9.100811514331498]
We introduce the UNDO: UNderstanding Distillation as Optimization framework. Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales. Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods.
arXiv Detail & Related papers (2025-04-03T12:18:51Z)
Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures [4.960025399247103]
Generic Teacher Network (GTN) is a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a finite pool of architectures. Our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.
arXiv Detail & Related papers (2024-07-22T20:34:00Z)
Understanding the Effects of Projectors in Knowledge Distillation [31.882356225974632]
Even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. This paper investigates the implicit role that projectors play but so far have been overlooked. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance.
arXiv Detail & Related papers (2023-10-26T06:30:39Z)
Can a student Large Language Model perform as well as it's teacher? [0.0]
Knowledge distillation aims to transfer knowledge from a high-capacity "teacher" model to a streamlined "student" model. This paper provides a comprehensive overview of the knowledge distillation paradigm.
arXiv Detail & Related papers (2023-10-03T20:34:59Z)
Student-friendly Knowledge Distillation [1.5469452301122173]
We propose student-friendly knowledge distillation (SKD) to simplify teacher output into new knowledge representations. SKD contains a softening processing and a learning simplifier. The experimental results on the CIFAR-100 and ImageNet datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-05-18T11:44:30Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks. Knowledge distillation extracts knowledge from the teacher and integrates it with the target model. Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z)
On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness. We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z)
Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student. Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z)
Contrastive Distillation on Intermediate Representations for Language Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.