Unified and Effective Ensemble Knowledge Distillation
- URL: http://arxiv.org/abs/2204.00548v1
- Date: Fri, 1 Apr 2022 16:15:39 GMT
- Title: Unified and Effective Ensemble Knowledge Distillation
- Authors: Chuhan Wu, Fangzhao Wu, Tao Qi and Yongfeng Huang
- Abstract summary: Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model.
Many existing methods learn and distill the student model on labeled data only.
We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
- Score: 92.67156911466397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensemble knowledge distillation can extract knowledge from multiple teacher
models and encode it into a single student model. Many existing methods learn
and distill the student model on labeled data only. However, the teacher models
are usually learned on the same labeled data, and their predictions have high
correlations with groudtruth labels. Thus, they cannot provide sufficient
knowledge complementary to task labels for student teaching. Distilling on
unseen unlabeled data has the potential to enhance the knowledge transfer from
the teachers to the student. In this paper, we propose a unified and effective
ensemble knowledge distillation method that distills a single student model
from an ensemble of teacher models on both labeled and unlabeled data. Since
different teachers may have diverse prediction correctness on the same sample,
on labeled data we weight the predictions of different teachers according to
their correctness. In addition, we weight the distillation loss based on the
overall prediction correctness of the teacher ensemble to distill high-quality
knowledge. On unlabeled data, there is no groundtruth to evaluate prediction
correctness. Fortunately, the disagreement among teachers is an indication of
sample hardness, and thereby we weight the distillation loss based on teachers'
disagreement to emphasize knowledge distillation on important samples.
Extensive experiments on four datasets show the effectiveness of our proposed
ensemble distillation method.
Related papers
- Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Controlling the Quality of Distillation in Response-Based Network
Compression [0.0]
The performance of a compressed network is governed by the quality of distillation.
For a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher.
arXiv Detail & Related papers (2021-12-19T02:53:51Z) - Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z) - Knowledge Distillation as Semiparametric Inference [44.572422527672416]
A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model.
This two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data.
We cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate.
arXiv Detail & Related papers (2021-04-20T03:00:45Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z) - Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model.
While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help?
We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.