Towards Understanding Ensemble, Knowledge Distillation and
Self-Distillation in Deep Learning
- URL: http://arxiv.org/abs/2012.09816v1
- Date: Thu, 17 Dec 2020 18:34:45 GMT
- Title: Towards Understanding Ensemble, Knowledge Distillation and
Self-Distillation in Deep Learning
- Authors: Zeyuan Allen-Zhu and Yuanzhi Li
- Abstract summary: We study how Ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model.
We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory.
We prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
- Score: 93.18238573921629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We formally study how Ensemble of deep learning models can improve test
accuracy, and how the superior performance of ensemble can be distilled into a
single model using Knowledge Distillation. We consider the challenging case
where the ensemble is simply an average of the outputs of a few independently
trained neural networks with the SAME architecture, trained using the SAME
algorithm on the SAME data set, and they only differ by the random seeds used
in the initialization. We empirically show that ensemble/knowledge distillation
in deep learning works very differently from traditional learning theory,
especially differently from ensemble of random feature mappings or the
neural-tangent-kernel feature mappings, and is potentially out of the scope of
existing theorems. Thus, to properly understand ensemble and knowledge
distillation in deep learning, we develop a theory showing that when data has a
structure we refer to as "multi-view", then ensemble of independently trained
neural networks can provably improve test accuracy, and such superior test
accuracy can also be provably distilled into a single model by training a
single model to match the output of the ensemble instead of the true label. Our
result sheds light on how ensemble works in deep learning in a way that is
completely different from traditional theorems, and how the "dark knowledge" is
hidden in the outputs of the ensemble -- that can be used in knowledge
distillation -- comparing to the true data labels. In the end, we prove that
self-distillation can also be viewed as implicitly combining ensemble and
knowledge distillation to improve test accuracy.
Related papers
- LAKD-Activation Mapping Distillation Based on Local Learning [12.230042188890838]
This paper proposes a novel knowledge distillation framework, Local Attention Knowledge Distillation (LAKD)
LAKD more efficiently utilizes the distilled information from teacher networks, achieving higher interpretability and competitive performance.
We conducted experiments on the CIFAR-10, CIFAR-100, and ImageNet datasets, and the results show that our LAKD method significantly outperforms existing methods.
arXiv Detail & Related papers (2024-08-21T09:43:27Z) - Towards a theory of model distillation [0.0]
Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original.
We show how to efficiently distill neural networks into succinct, explicit decision tree representations.
We prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity.
arXiv Detail & Related papers (2024-03-14T02:42:19Z) - Learning Discretized Bayesian Networks with GOMEA [0.0]
We extend an existing state-of-the-art structure learning approach to jointly learn variable discretizations.
We show how this enables incorporating expert knowledge in a uniquely insightful fashion, finding multiple DBNs that trade-off complexity, accuracy, and the difference with a pre-determined expert network.
arXiv Detail & Related papers (2024-02-19T14:29:35Z) - Distribution Shift Matters for Knowledge Distillation with Webly
Collected Images [91.66661969598755]
We propose a novel method dubbed Knowledge Distillation between Different Distributions" (KD$3$)
We first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network.
We also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment.
arXiv Detail & Related papers (2023-07-21T10:08:58Z) - Self-Knowledge Distillation for Surgical Phase Recognition [8.708027525926193]
We propose a self-knowledge distillation framework that can be integrated into current state-of-the-art (SOTA) models.
Our framework is embedded on top of four popular SOTA approaches and consistently improves their performance.
arXiv Detail & Related papers (2023-06-15T08:55:00Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - CMD: Self-supervised 3D Action Representation Learning with Cross-modal
Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities.
We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs.
Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z) - Distilling Holistic Knowledge with Graph Neural Networks [37.86539695906857]
Knowledge Distillation (KD) aims at transferring knowledge from a larger well-optimized teacher network to a smaller learnable student network.
Existing KD methods have mainly considered two types of knowledge, namely the individual knowledge and the relational knowledge.
We propose to distill the novel holistic knowledge based on an attributed graph constructed among instances.
arXiv Detail & Related papers (2021-08-12T02:47:59Z) - Self-distillation with Batch Knowledge Ensembling Improves ImageNet
Classification [57.5041270212206]
We present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images.
BAKE achieves online knowledge ensembling across multiple samples with only a single network.
It requires minimal computational and memory overhead compared to existing knowledge ensembling methods.
arXiv Detail & Related papers (2021-04-27T16:11:45Z) - Towards a Universal Continuous Knowledge Base [49.95342223987143]
We propose a method for building a continuous knowledge base that can store knowledge imported from multiple neural networks.
Experiments on text classification show promising results.
We import the knowledge from multiple models to the knowledge base, from which the fused knowledge is exported back to a single model.
arXiv Detail & Related papers (2020-12-25T12:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.