Related papers: Confidence-Aware Multi-Teacher Knowledge Distillation

Confidence-Aware Multi-Teacher Knowledge Distillation

URL: http://arxiv.org/abs/2201.00007v1
Date: Thu, 30 Dec 2021 11:00:49 GMT
Title: Confidence-Aware Multi-Teacher Knowledge Distillation
Authors: Hailin Zhang, Defang Chen, Can Wang
Abstract summary: Confidence-Aware Multi-teacher Knowledge Distillation (CA-MKD) is proposed. It adaptively assigns sample-wise reliability for each teacher prediction with the help of ground-truth labels. Our CA-MKD consistently outperforms all compared state-of-the-art methods across various teacher-student architectures.
Score: 12.938478021855245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation is initially introduced to utilize additional supervision from a single teacher model for the student model training. To boost the student performance, some recent variants attempt to exploit diverse knowledge sources from multiple teachers. However, existing studies mainly integrate knowledge from diverse sources by averaging over multiple teacher predictions or combining them using other various label-free strategies, which may mislead student in the presence of low-quality teacher predictions. To tackle this problem, we propose Confidence-Aware Multi-teacher Knowledge Distillation (CA-MKD), which adaptively assigns sample-wise reliability for each teacher prediction with the help of ground-truth labels, with those teacher predictions close to one-hot labels assigned large weights. Besides, CA-MKD incorporates intermediate layers to further improve student performance. Extensive experiments show that our CA-MKD consistently outperforms all compared state-of-the-art methods across various teacher-student architectures.

Related papers

Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition [24.293448609592147]
Multi-Teacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network. This paper proposes Multi-Teacher Knowledge Distillation with Reinforcement Learning (MTKD-RL) to optimize multi-teacher weights.
arXiv Detail & Related papers (2025-02-22T09:31:24Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference. We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z)
Adaptive Multi-Teacher Knowledge Distillation with Meta-Learning [16.293262022872412]
We propose Adaptive Multi-teacher Knowledge Distillation with Meta-Learning (MMKD) to supervise student with appropriate knowledge from a tailored ensemble teacher. With the help of a meta-weight network, the diverse yet compatible teacher knowledge in the output layer and intermediate layers is jointly leveraged to enhance the student performance.
arXiv Detail & Related papers (2023-06-11T09:38:45Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
Faculty Distillation with Optimal Transport [53.69235109551099]
We propose to link teacher's task and student's task by optimal transport. Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions. Experiments under various settings demonstrate the succinctness and versatility of our method.
arXiv Detail & Related papers (2022-04-25T09:34:37Z)
Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student. IKD trains the teacher model to generate specific soft target at each training step for a certain student. Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z)
One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression. We show that MT-BERT can train high-quality student model from multiple teacher PLMs. Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z)
Adaptive Multi-Teacher Multi-level Knowledge Distillation [11.722728148523366]
We propose a novel adaptive multi-teacher multi-level knowledge distillation learning framework(AMTML-KD) It consists two novel insights: (i) associating each teacher with a latent representation to adaptively learn instance-level teacher importance weights. As such, a student model can learn multi-level knowledge from multiple teachers through AMTML-KD.
arXiv Detail & Related papers (2021-03-06T08:18:16Z)
Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model. We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap. Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.