Knowledge Distillation from A Stronger Teacher
- URL: http://arxiv.org/abs/2205.10536v1
- Date: Sat, 21 May 2022 08:30:58 GMT
- Title: Knowledge Distillation from A Stronger Teacher
- Authors: Tao Huang, Shan You, Fei Wang, Chen Qian, Chang Xu
- Abstract summary: This paper presents a method dubbed DIST to distill better from a stronger teacher.
We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer.
Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures.
- Score: 44.11781464210916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unlike existing knowledge distillation methods focus on the baseline
settings, where the teacher models and training strategies are not that strong
and competing as state-of-the-art approaches, this paper presents a method
dubbed DIST to distill better from a stronger teacher. We empirically find that
the discrepancy of predictions between the student and a stronger teacher may
tend to be fairly severer. As a result, the exact match of predictions in KL
divergence would disturb the training and make existing methods perform poorly.
In this paper, we show that simply preserving the relations between the
predictions of teacher and student would suffice, and propose a
correlation-based loss to capture the intrinsic inter-class relations from the
teacher explicitly. Besides, considering that different instances have
different semantic similarities to each class, we also extend this relational
match to the intra-class level. Our method is simple yet practical, and
extensive experiments demonstrate that it adapts well to various architectures,
model sizes and training strategies, and can achieve state-of-the-art
performance consistently on image classification, object detection, and
semantic segmentation tasks. Code is available at:
https://github.com/hunto/DIST_KD .
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Cosine Similarity Knowledge Distillation for Individual Class
Information Transfer [11.544799404018473]
We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance.
We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings.
We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
arXiv Detail & Related papers (2023-11-24T06:34:47Z) - Contrastive Knowledge Amalgamation for Unsupervised Image Classification [2.6392087010521728]
Contrastive Knowledge Amalgamation (CKA) aims to learn a compact student model to handle the joint objective from multiple teacher models.
Contrastive losses intra- and inter- models are designed to widen the distance between representations of different classes.
The alignment loss is introduced to minimize the sample-level distribution differences of teacher-student models in the common representation space.
arXiv Detail & Related papers (2023-07-27T11:21:14Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Contrastive Learning for Fair Representations [50.95604482330149]
Trained classification models can unintentionally lead to biased representations and predictions.
Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise.
We propose a method for mitigating bias by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations.
arXiv Detail & Related papers (2021-09-22T10:47:51Z) - Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z) - Feature Distillation With Guided Adversarial Contrastive Learning [41.28710294669751]
We propose Guided Adversarial Contrastive Distillation (GACD) to transfer adversarial robustness from teacher to student with features.
With a well-trained teacher model as an anchor, students are expected to extract features similar to the teacher.
With GACD, the student not only learns to extract robust features, but also captures structural knowledge from the teacher.
arXiv Detail & Related papers (2020-09-21T14:46:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.