Online Ensemble Model Compression using Knowledge Distillation
- URL: http://arxiv.org/abs/2011.07449v1
- Date: Sun, 15 Nov 2020 04:46:29 GMT
- Title: Online Ensemble Model Compression using Knowledge Distillation
- Authors: Devesh Walawalkar, Zhiqiang Shen, Marios Savvides
- Abstract summary: This paper presents a knowledge distillation based model compression framework consisting of a student ensemble.
It enables distillation of simultaneously learnt ensemble knowledge onto each of the compressed student models.
We provide comprehensive experiments using state-of-the-art classification models to validate our framework's effectiveness.
- Score: 51.59021417947258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a novel knowledge distillation based model compression
framework consisting of a student ensemble. It enables distillation of
simultaneously learnt ensemble knowledge onto each of the compressed student
models. Each model learns unique representations from the data distribution due
to its distinct architecture. This helps the ensemble generalize better by
combining every model's knowledge. The distilled students and ensemble teacher
are trained simultaneously without requiring any pretrained weights. Moreover,
our proposed method can deliver multi-compressed students with single training,
which is efficient and flexible for different scenarios. We provide
comprehensive experiments using state-of-the-art classification models to
validate our framework's effectiveness. Notably, using our framework a 97%
compressed ResNet110 student model managed to produce a 10.64% relative
accuracy gain over its individual baseline training on CIFAR100 dataset.
Similarly a 95% compressed DenseNet-BC(k=12) model managed a 8.17% relative
accuracy gain.
Related papers
- Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures [4.960025399247103]
Generic Teacher Network (GTN) is a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a finite pool of architectures.
Our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.
arXiv Detail & Related papers (2024-07-22T20:34:00Z) - Enhancing One-Shot Federated Learning Through Data and Ensemble
Co-Boosting [76.64235084279292]
One-shot Federated Learning (OFL) has become a promising learning paradigm, enabling the training of a global server model via a single communication round.
We introduce a novel framework, Co-Boosting, in which synthesized data and the ensemble model mutually enhance each other progressively.
arXiv Detail & Related papers (2024-02-23T03:15:10Z) - PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory
Access Prediction Models [2.404163279345609]
PaCKD is a pattern-Clustered Knowledge Distillation approach to compress MAP models.
PaCKD yields an 8.70% higher result compared to student models trained with standard knowledge distillation and an 8.88% higher result compared to student models trained without any form of knowledge distillation.
arXiv Detail & Related papers (2024-02-21T00:24:34Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - CDFKD-MFS: Collaborative Data-free Knowledge Distillation via
Multi-level Feature Sharing [24.794665141853905]
We propose a framework termed collaborative data-free knowledge distillation via multi-level feature sharing.
The accuracy of the proposed framework is 1.18% higher on the CIFAR-100 dataset, 1.67% higher on the Caltech dataset, and 2.99% higher on the mini-ImageNet dataset.
arXiv Detail & Related papers (2022-05-24T07:11:03Z) - Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models.
Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model.
We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z) - Towards Understanding Ensemble, Knowledge Distillation and
Self-Distillation in Deep Learning [93.18238573921629]
We study how Ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model.
We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory.
We prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
arXiv Detail & Related papers (2020-12-17T18:34:45Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Extracurricular Learning: Knowledge Transfer Beyond Empirical
Distribution [17.996541285382463]
We propose extracurricular learning to bridge the gap between a compressed student model and its teacher.
We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%.
This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures.
arXiv Detail & Related papers (2020-06-30T18:21:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.