Related papers: Generalized Knowledge Distillation via Relationship Matching

Generalized Knowledge Distillation via Relationship Matching

URL: http://arxiv.org/abs/2205.01915v1
Date: Wed, 4 May 2022 06:49:47 GMT
Title: Generalized Knowledge Distillation via Relationship Matching
Authors: Han-Jia Ye, Su Lu, De-Chuan Zhan
Abstract summary: Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks. Knowledge distillation extracts knowledge from the teacher and integrates it with the target model. Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
Score: 53.69235109551099
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks. Knowledge distillation extracts knowledge from the teacher and integrates it with the target model (a.k.a. the "student"), which expands the student's knowledge and improves its learning efficacy. Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space -- in this "Generalized Knowledge Distillation (GKD)", the classes of the teacher and the student may be the same, completely different, or partially overlapped. We claim that the comparison ability between instances acts as an essential factor threading knowledge across tasks, and propose the RElationship FacIlitated Local cLassifiEr Distillation (REFILLED) approach, which decouples the GKD flow of the embedding and the top-layer classifier. In particular, different from reconciling the instance-label confidence between models, REFILLED requires the teacher to reweight the hard tuples pushed forward by the student and then matches the similarity comparison levels between instances. An embedding-induced classifier based on the teacher model supervises the student's classification confidence and adaptively emphasizes the most related supervision from the teacher. REFILLED demonstrates strong discriminative ability when the classes of the teacher vary from the same to a fully non-overlapped set w.r.t. the student. It also achieves state-of-the-art performance on standard knowledge distillation, one-step incremental learning, and few-shot learning tasks.

Related papers

Group Relative Knowledge Distillation: Learning from Teacher's Relational Inductive Bias [5.434571018755813]
Group Relative Knowledge Distillation (GRKD) is a novel framework that distills teacher knowledge by learning the relative ranking among classes. Experiments on classification benchmarks demonstrate GRKD achieves superior generalization compared to existing methods.
arXiv Detail & Related papers (2025-04-29T07:23:22Z)
Relational Representation Distillation [6.24302896438145]
Knowledge distillation involves transferring knowledge from large, cumbersome teacher models to more compact student models.<n>Standard approaches fail to capture important structural relationships in the teacher's internal representations.<n>Recent advances have turned to contrastive learning objectives, but these methods impose overly strict constraints through instance-discrimination.<n>Our method employs separate temperature parameters for teacher and student distributions, with sharper student outputs, enabling precise learning of primary relationships while preserving secondary similarities.
arXiv Detail & Related papers (2024-07-16T14:56:13Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
Knowledge Distillation from A Stronger Teacher [44.11781464210916]
This paper presents a method dubbed DIST to distill better from a stronger teacher. We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer. Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures.
arXiv Detail & Related papers (2022-05-21T08:30:58Z)
Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood. We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z)
Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student. Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z)
Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model. We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap. Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z)
Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation [22.14228918338769]
We propose a novel training framework, where the learning of general knowledge is more in line with the idea of reaching consensus. Our framework effectively improves the model generalization without sacrificing training efficiency.
arXiv Detail & Related papers (2021-02-22T05:23:34Z)
Multi-level Knowledge Distillation [13.71183256776644]
We introduce Multi-level Knowledge Distillation (MLKD) to transfer richer representational knowledge from teacher to student networks. MLKD employs three novel teacher-student similarities: individual similarity, relational similarity, and categorical similarity. Experiments demonstrate that MLKD outperforms other state-of-the-art methods on both similar-architecture and cross-architecture tasks.
arXiv Detail & Related papers (2020-12-01T15:27:15Z)
Dual Policy Distillation [58.43610940026261]
Policy distillation, which transfers a teacher policy to a student policy, has achieved great success in challenging tasks of deep reinforcement learning. In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment. The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms.
arXiv Detail & Related papers (2020-06-07T06:49:47Z)
Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another. We design data augmentation agents with distinct roles to facilitate knowledge distillation. We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.