Improving Ensemble Distillation With Weight Averaging and Diversifying
Perturbation
- URL: http://arxiv.org/abs/2206.15047v1
- Date: Thu, 30 Jun 2022 06:23:03 GMT
- Title: Improving Ensemble Distillation With Weight Averaging and Diversifying
Perturbation
- Authors: Giung Nam, Hyungi Lee, Byeongho Heo, Juho Lee
- Abstract summary: It motivates distilling knowledge from the ensemble teacher into a smaller student network.
We propose a weight averaging technique where a student with multipleworks is trained to absorb the functional diversity of ensemble teachers.
We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student.
- Score: 22.87106703794863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensembles of deep neural networks have demonstrated superior performance, but
their heavy computational cost hinders applying them for resource-limited
environments. It motivates distilling knowledge from the ensemble teacher into
a smaller student network, and there are two important design choices for this
ensemble distillation: 1) how to construct the student network, and 2) what
data should be shown during training. In this paper, we propose a weight
averaging technique where a student with multiple subnetworks is trained to
absorb the functional diversity of ensemble teachers, but then those
subnetworks are properly averaged for inference, giving a single student
network with no additional inference cost. We also propose a perturbation
strategy that seeks inputs from which the diversities of teachers can be better
transferred to the student. Combining these two, our method significantly
improves upon previous methods on various image classification tasks.
Related papers
- Distribution Shift Matters for Knowledge Distillation with Webly
Collected Images [91.66661969598755]
We propose a novel method dubbed Knowledge Distillation between Different Distributions" (KD$3$)
We first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network.
We also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment.
arXiv Detail & Related papers (2023-07-21T10:08:58Z) - Crowd Counting with Online Knowledge Learning [23.602652841154164]
We propose an online knowledge learning method for crowd counting.
Our method builds an end-to-end training framework that integrates two independent networks into a single architecture.
Our method achieves comparable performance to state-of-the-art methods despite using far fewer parameters.
arXiv Detail & Related papers (2023-03-18T03:27:57Z) - Knowledge Distillation via Weighted Ensemble of Teaching Assistants [18.593268785143426]
Knowledge distillation is the process of transferring knowledge from a large model called the teacher to a smaller model called the student.
When the network size gap between the teacher and student increases, the performance of the student network decreases.
We have shown that using multiple teaching assistant models, the student model (the smaller model) can be further improved.
arXiv Detail & Related papers (2022-06-23T22:50:05Z) - Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student
Settings and its Superiority to Kernel Methods [58.44819696433327]
We investigate the risk of two-layer ReLU neural networks in a teacher regression model.
We find that the student network provably outperforms any solution methods.
arXiv Detail & Related papers (2022-05-30T02:51:36Z) - Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For
Model Compression [2.538209532048867]
Mutual Learning (ML) provides an alternative strategy where multiple simple student networks benefit from sharing knowledge.
We propose a single-teacher, multi-student framework that leverages both KD and ML to achieve better performance.
arXiv Detail & Related papers (2021-10-21T09:59:31Z) - Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance.
Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Densely Guided Knowledge Distillation using Multiple Teacher Assistants [5.169724825219126]
We propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size.
We also design teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped.
This acts as a regularizer to improve the efficiency of teaching of the student network.
arXiv Detail & Related papers (2020-09-18T13:12:52Z) - Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z) - Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation.
In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation.
Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.