Distilling a Powerful Student Model via Online Knowledge Distillation
- URL: http://arxiv.org/abs/2103.14473v2
- Date: Mon, 29 Mar 2021 07:04:28 GMT
- Title: Distilling a Powerful Student Model via Online Knowledge Distillation
- Authors: Shaojie Li, Mingbao Lin, Yan Wang, Feiyue Huang, Yongjian Wu, Yonghong
Tian, Ling Shao, Rongrong Ji
- Abstract summary: Existing online knowledge distillation approaches either adopt the student with the best performance or construct an ensemble model for better holistic performance.
We propose a novel method for online knowledge distillation, termed FFSD, which comprises two key components: Feature Fusion and Self-Distillation.
- Score: 158.68873654990895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing online knowledge distillation approaches either adopt the student
with the best performance or construct an ensemble model for better holistic
performance. However, the former strategy ignores other students' information,
while the latter increases the computational complexity. In this paper, we
propose a novel method for online knowledge distillation, termed FFSD, which
comprises two key components: Feature Fusion and Self-Distillation, towards
solving the above problems in a unified framework. Different from previous
works, where all students are treated equally, the proposed FFSD splits them
into a student leader and a common student set. Then, the feature fusion module
converts the concatenation of feature maps from all common students into a
fused feature map. The fused representation is used to assist the learning of
the student leader. To enable the student leader to absorb more diverse
information, we design an enhancement strategy to increase the diversity among
students. Besides, a self-distillation module is adopted to convert the feature
map of deeper layers into a shallower one. Then, the shallower layers are
encouraged to mimic the transformed feature maps of the deeper layers, which
helps the students to generalize better. After training, we simply adopt the
student leader, which achieves superior performance, over the common students,
without increasing the storage or inference cost. Extensive experiments on
CIFAR-100 and ImageNet demonstrate the superiority of our FFSD over existing
works. The code is available at https://github.com/SJLeo/FFSD.
Related papers
- Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation [29.952771954087602]
Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos.
This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance.
arXiv Detail & Related papers (2023-08-07T17:07:48Z) - Improving Ensemble Distillation With Weight Averaging and Diversifying
Perturbation [22.87106703794863]
It motivates distilling knowledge from the ensemble teacher into a smaller student network.
We propose a weight averaging technique where a student with multipleworks is trained to absorb the functional diversity of ensemble teachers.
We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student.
arXiv Detail & Related papers (2022-06-30T06:23:03Z) - Alignahead: Online Cross-Layer Knowledge Extraction on Graph Neural
Networks [6.8080936803807734]
Existing knowledge distillation methods on graph neural networks (GNNs) are almost offline.
We propose a novel online knowledge distillation framework to resolve this problem.
We develop a cross-layer distillation strategy by aligning ahead one student layer with the layer in different depth of another student model.
arXiv Detail & Related papers (2022-05-05T06:48:13Z) - Extracting knowledge from features with multilevel abstraction [3.4443503349903124]
Self-knowledge distillation (SKD) aims at transferring the knowledge from a large teacher model to a small student model.
In this paper, we purpose a novel SKD method in a different way from the main stream methods.
Experiments and ablation studies show its great effectiveness and generalization on various kinds of tasks.
arXiv Detail & Related papers (2021-12-04T02:25:46Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Progressive Network Grafting for Few-Shot Knowledge Distillation [60.38608462158474]
We introduce a principled dual-stage distillation scheme tailored for few-shot data.
In the first step, we graft the student blocks one by one onto the teacher, and learn the parameters of the grafted block intertwined with those of the other teacher blocks.
Experiments demonstrate that our approach, with only a few unlabeled samples, achieves gratifying results on CIFAR10, CIFAR100, and ILSVRC-2012.
arXiv Detail & Related papers (2020-12-09T08:34:36Z) - Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z) - Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications.
We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network.
Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.