Adaptive Distillation: Aggregating Knowledge from Multiple Paths for
Efficient Distillation
- URL: http://arxiv.org/abs/2110.09674v1
- Date: Tue, 19 Oct 2021 00:57:40 GMT
- Title: Adaptive Distillation: Aggregating Knowledge from Multiple Paths for
Efficient Distillation
- Authors: Sumanth Chennupati, Mohammad Mahdi Kamani, Zhongwei Cheng, Lin Chen
- Abstract summary: Knowledge Distillation is becoming one of the primary trends among neural network compression algorithms.
This paper introduces our proposed adaptive approach based on multitask learning methods.
We empirically demonstrate the effectiveness of the proposed approach over other baselines on the applications of knowledge distillation in classification, semantic segmentation, and object detection tasks.
- Score: 15.337420940135704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Distillation is becoming one of the primary trends among neural
network compression algorithms to improve the generalization performance of a
smaller student model with guidance from a larger teacher model. This momentous
rise in applications of knowledge distillation is accompanied by the
introduction of numerous algorithms for distilling the knowledge such as soft
targets and hint layers. Despite this advancement in different techniques for
distilling the knowledge, the aggregation of different paths for distillation
has not been studied comprehensively. This is of particular significance, not
only because different paths have different importance, but also due to the
fact that some paths might have negative effects on the generalization
performance of the student model. Hence, we need to adaptively adjust the
importance of each path to maximize the impact of distillation on the student
model. In this paper, we explore different approaches for aggregating these
different paths and introduce our proposed adaptive approach based on multitask
learning methods. We empirically demonstrate the effectiveness of the proposed
approach over other baselines on the applications of knowledge distillation in
classification, semantic segmentation, and object detection tasks.
Related papers
- AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting [5.818420448447701]
We propose Adaptive Knowledge Distillation, a novel technique inspired by curriculum learning to adaptively weigh the losses at instance level.
Our method follows a plug-and-play paradigm that can be applied on top of any task-specific and distillation objectives.
arXiv Detail & Related papers (2024-05-11T15:06:24Z) - AICSD: Adaptive Inter-Class Similarity Distillation for Semantic
Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.
The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs.
Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z) - The Staged Knowledge Distillation in Video Classification: Harmonizing
Student Progress by a Complementary Weakly Supervised Framework [21.494759678807686]
We propose a new weakly supervised learning framework for knowledge distillation in video classification.
Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages.
Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
arXiv Detail & Related papers (2023-07-11T12:10:42Z) - Self-Knowledge Distillation via Dropout [0.7883397954991659]
We propose a simple and effective self-knowledge distillation method using a dropout (SD-Dropout)
Our method does not require any additional trainable modules, does not rely on data, and requires only simple operations.
arXiv Detail & Related papers (2022-08-11T05:08:55Z) - A Closer Look at Knowledge Distillation with Features, Logits, and
Gradients [81.39206923719455]
Knowledge distillation (KD) is a substantial strategy for transferring learned knowledge from one neural network model to another.
This work provides a new perspective to motivate a set of knowledge distillation strategies by approximating the classical KL-divergence criteria with different knowledge sources.
Our analysis indicates that logits are generally a more efficient knowledge source and suggests that having sufficient feature dimensions is crucial for the model design.
arXiv Detail & Related papers (2022-03-18T21:26:55Z) - Distilling Image Classifiers in Object Detectors [81.63849985128527]
We study the case of object detection and, instead of following the standard detector-to-detector distillation approach, introduce a classifier-to-detector knowledge transfer framework.
In particular, we propose strategies to exploit the classification teacher to improve both the detector's recognition accuracy and localization performance.
arXiv Detail & Related papers (2021-06-09T16:50:10Z) - Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z) - There and Back Again: Revisiting Backpropagation Saliency Methods [87.40330595283969]
Saliency methods seek to explain the predictions of a model by producing an importance map across each input sample.
A popular class of such methods is based on backpropagating a signal and analyzing the resulting gradient.
We propose a single framework under which several such methods can be unified.
arXiv Detail & Related papers (2020-04-06T17:58:08Z) - Disentangling Adaptive Gradient Methods from Learning Rates [65.0397050979662]
We take a deeper look at how adaptive gradient methods interact with the learning rate schedule.
We introduce a "grafting" experiment which decouples an update's magnitude from its direction.
We present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods.
arXiv Detail & Related papers (2020-02-26T21:42:49Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.