Decoupled Knowledge with Ensemble Learning for Online Distillation
- URL: http://arxiv.org/abs/2312.11218v1
- Date: Mon, 18 Dec 2023 14:08:59 GMT
- Title: Decoupled Knowledge with Ensemble Learning for Online Distillation
- Authors: Baitan Shao, Ying Chen
- Abstract summary: Online knowledge distillation is a one-stage strategy that alleviates the requirement with mutual learning and collaborative learning.
Recent peer collaborative learning (PCL) integrates online ensemble, collaboration of base networks and temporal mean teacher to construct effective knowledge.
A decoupled knowledge for online knowledge distillation is generated by an independent teacher, separate from the student.
- Score: 3.794605440322862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline distillation is a two-stage pipeline that requires expensive
resources to train a teacher network and then distill the knowledge to a
student for deployment. Online knowledge distillation, on the other hand, is a
one-stage strategy that alleviates the requirement with mutual learning and
collaborative learning. Recent peer collaborative learning (PCL) integrates
online ensemble, collaboration of base networks and temporal mean teacher to
construct effective knowledge. However, the model collapses occasionally in PCL
due to high homogenization between the student and the teacher. In this paper,
the cause of the high homogenization is analyzed and the solution is presented.
A decoupled knowledge for online knowledge distillation is generated by an
independent teacher, separate from the student. Such design can increase the
diversity between the networks and reduce the possibility of model collapse. To
obtain early decoupled knowledge, an initialization scheme for the teacher is
devised, and a 2D geometry-based analysis experiment is conducted under ideal
conditions to showcase the effectiveness of this scheme. Moreover, to improve
the teacher's supervisory resilience, a decaying ensemble scheme is devised. It
assembles the knowledge of the teacher to which a dynamic weight which is large
at the start of the training and gradually decreases with the training process
is assigned. The assembled knowledge serves as a strong teacher during the
early training and the decreased-weight-assembled knowledge can eliminate the
distribution deviation under the potentially overfitted teacher's supervision.
A Monte Carlo-based simulation is conducted to evaluate the convergence.
Extensive experiments on CIFAR-10, CIFAR-100 and TinyImageNet show the
superiority of our method. Ablation studies and further analysis demonstrate
the effectiveness.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Toward Student-Oriented Teacher Network Training For Knowledge Distillation [40.55715466657349]
We propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM.
Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
arXiv Detail & Related papers (2022-06-14T07:51:25Z) - Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For
Model Compression [2.538209532048867]
Mutual Learning (ML) provides an alternative strategy where multiple simple student networks benefit from sharing knowledge.
We propose a single-teacher, multi-student framework that leverages both KD and ML to achieve better performance.
arXiv Detail & Related papers (2021-10-21T09:59:31Z) - Student Network Learning via Evolutionary Knowledge Distillation [22.030934154498205]
We propose an evolutionary knowledge distillation approach to improve the transfer effectiveness of teacher knowledge.
Instead of a fixed pre-trained teacher, an evolutionary teacher is learned online and consistently transfers intermediate knowledge to supervise student network learning on-the-fly.
In this way, the student can simultaneously obtain rich internal knowledge and capture its growth process, leading to effective student network learning.
arXiv Detail & Related papers (2021-03-23T02:07:15Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z) - Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation.
In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation.
Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z) - Peer Collaborative Learning for Online Knowledge Distillation [69.29602103582782]
Peer Collaborative Learning method integrates online ensembling and network collaboration into a unified framework.
Experiments on CIFAR-10, CIFAR-100 and ImageNet show that the proposed method significantly improves the generalisation of various backbone networks.
arXiv Detail & Related papers (2020-06-07T13:21:52Z) - Dual Policy Distillation [58.43610940026261]
Policy distillation, which transfers a teacher policy to a student policy, has achieved great success in challenging tasks of deep reinforcement learning.
In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment.
The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms.
arXiv Detail & Related papers (2020-06-07T06:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.