Related papers: Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation

Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation

URL: http://arxiv.org/abs/2206.15047v1
Date: Thu, 30 Jun 2022 06:23:03 GMT
Title: Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation
Authors: Giung Nam, Hyungi Lee, Byeongho Heo, Juho Lee
Abstract summary: It motivates distilling knowledge from the ensemble teacher into a smaller student network. We propose a weight averaging technique where a student with multipleworks is trained to absorb the functional diversity of ensemble teachers. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student.
Score: 22.87106703794863
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensembles of deep neural networks have demonstrated superior performance, but their heavy computational cost hinders applying them for resource-limited environments. It motivates distilling knowledge from the ensemble teacher into a smaller student network, and there are two important design choices for this ensemble distillation: 1) how to construct the student network, and 2) what data should be shown during training. In this paper, we propose a weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnetworks are properly averaged for inference, giving a single student network with no additional inference cost. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student. Combining these two, our method significantly improves upon previous methods on various image classification tasks.

Related papers

Contrastive Representation Distillation via Multi-Scale Feature Decoupling [0.49157446832511503]
Knowledge distillation is a technique aimed at enhancing the performance of a smaller student network without increasing its parameter size. We introduce multi-scale decoupling in the feature transfer process for the first time, where the decoupled local features are individually processed and integrated with contrastive learning. Our approach not only reduces computational costs but also enhances efficiency, enabling performance improvements for the student network using only single-batch samples.
arXiv Detail & Related papers (2025-02-09T10:03:18Z)
Ensemble Learning via Knowledge Transfer for CTR Prediction [9.891226177252653]
In this paper, we investigate larger ensemble networks and find three inherent limitations in commonly used ensemble learning method. We propose a novel model-agnostic Ensemble Knowledge Transfer Framework (EKTF) Experimental results on five real-world datasets demonstrate the effectiveness and compatibility of EKTF.
arXiv Detail & Related papers (2024-11-25T06:14:20Z)
Distribution Shift Matters for Knowledge Distillation with Webly Collected Images [91.66661969598755]
We propose a novel method dubbed Knowledge Distillation between Different Distributions" (KD$3$) We first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. We also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment.
arXiv Detail & Related papers (2023-07-21T10:08:58Z)
Crowd Counting with Online Knowledge Learning [23.602652841154164]
We propose an online knowledge learning method for crowd counting. Our method builds an end-to-end training framework that integrates two independent networks into a single architecture. Our method achieves comparable performance to state-of-the-art methods despite using far fewer parameters.
arXiv Detail & Related papers (2023-03-18T03:27:57Z)
Knowledge Distillation via Weighted Ensemble of Teaching Assistants [18.593268785143426]
Knowledge distillation is the process of transferring knowledge from a large model called the teacher to a smaller model called the student. When the network size gap between the teacher and student increases, the performance of the student network decreases. We have shown that using multiple teaching assistant models, the student model (the smaller model) can be further improved.
arXiv Detail & Related papers (2022-06-23T22:50:05Z)
Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods [58.44819696433327]
We investigate the risk of two-layer ReLU neural networks in a teacher regression model. We find that the student network provably outperforms any solution methods.
arXiv Detail & Related papers (2022-05-30T02:51:36Z)
Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression [2.538209532048867]
Mutual Learning (ML) provides an alternative strategy where multiple simple student networks benefit from sharing knowledge. We propose a single-teacher, multi-student framework that leverages both KD and ML to achieve better performance.
arXiv Detail & Related papers (2021-10-21T09:59:31Z)
Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance. Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z)
Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance. For the first time in knowledge distillation, cross-stage connection paths are proposed. Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z)
Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student. Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z)
Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework. DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z)
Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation. In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation. Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.