Online Knowledge Distillation via Multi-branch Diversity Enhancement
- URL: http://arxiv.org/abs/2010.00795v3
- Date: Fri, 13 Nov 2020 14:18:39 GMT
- Title: Online Knowledge Distillation via Multi-branch Diversity Enhancement
- Authors: Zheng Li, Ying Huang, Defang Chen, Tianren Luo, Ning Cai, Zhigeng Pan
- Abstract summary: We propose a new distillation method to enhance the diversity among multiple student models.
We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network.
We also use Diversification(CD) loss function to strengthen the differences between the student models.
- Score: 15.523646047674717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is an effective method to transfer the knowledge from
the cumbersome teacher model to the lightweight student model. Online knowledge
distillation uses the ensembled prediction results of multiple student models
as soft targets to train each student model. However, the homogenization
problem will lead to difficulty in further improving model performance. In this
work, we propose a new distillation method to enhance the diversity among
multiple student models. We introduce Feature Fusion Module (FFM), which
improves the performance of the attention mechanism in the network by
integrating rich semantic information contained in the last block of multiple
student models. Furthermore, we use the Classifier Diversification(CD) loss
function to strengthen the differences between the student models and deliver a
better ensemble result. Extensive experiments proved that our method
significantly enhances the diversity among student models and brings better
distillation performance. We evaluate our method on three image classification
datasets: CIFAR-10/100 and CINIC-10. The results show that our method achieves
state-of-the-art performance on these datasets.
Related papers
- Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution [81.81748032199813]
We propose a Distillation-Free One-Step Diffusion model.
Specifically, we propose a noise-aware discriminator (NAD) to participate in adversarial training.
We improve the perceptual loss with edge-aware DISTS (EA-DISTS) to enhance the model's ability to generate fine details.
arXiv Detail & Related papers (2024-10-05T16:41:36Z) - Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment [10.104085497265004]
We propose Ranking Loss based Knowledge Distillation (RLKD), which encourages consistency of peak predictions between the teacher and student models.
Our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.
arXiv Detail & Related papers (2024-09-19T08:06:42Z) - AMD: Automatic Multi-step Distillation of Large-scale Vision Models [39.70559487432038]
We present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression.
An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance.
arXiv Detail & Related papers (2024-07-05T01:35:42Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Extracting knowledge from features with multilevel abstraction [3.4443503349903124]
Self-knowledge distillation (SKD) aims at transferring the knowledge from a large teacher model to a small student model.
In this paper, we purpose a novel SKD method in a different way from the main stream methods.
Experiments and ablation studies show its great effectiveness and generalization on various kinds of tasks.
arXiv Detail & Related papers (2021-12-04T02:25:46Z) - Similarity Transfer for Knowledge Distillation [25.042405967561212]
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one.
We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples.
It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
arXiv Detail & Related papers (2021-03-18T06:54:59Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.