Differentiable Feature Aggregation Search for Knowledge Distillation
- URL: http://arxiv.org/abs/2008.00506v1
- Date: Sun, 2 Aug 2020 15:42:29 GMT
- Title: Differentiable Feature Aggregation Search for Knowledge Distillation
- Authors: Yushuo Guan, Pengyu Zhao, Bingxuan Wang, Yuanxing Zhang, Cong Yao,
Kaigui Bian, Jian Tang
- Abstract summary: We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
- Score: 47.94874193183427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation has become increasingly important in model
compression. It boosts the performance of a miniaturized student network with
the supervision of the output distribution and feature maps from a
sophisticated teacher network. Some recent works introduce multi-teacher
distillation to provide more supervision to the student network. However, the
effectiveness of multi-teacher distillation methods are accompanied by costly
computation resources. To tackle with both the efficiency and the effectiveness
of knowledge distillation, we introduce the feature aggregation to imitate the
multi-teacher distillation in the single-teacher distillation framework by
extracting informative supervision from multiple teacher feature maps.
Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation
search method that motivated by DARTS in neural architecture search, to
efficiently find the aggregations. In the first stage, DFA formulates the
searching problem as a bi-level optimization and leverages a novel bridge loss,
which consists of a student-to-teacher path and a teacher-to-student path, to
find appropriate feature aggregations. The two paths act as two players against
each other, trying to optimize the unified architecture parameters to the
opposite directions while guaranteeing both expressivity and learnability of
the feature aggregation simultaneously. In the second stage, DFA performs
knowledge distillation with the derived feature aggregation. Experimental
results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10
datasets under various teacher-student settings, verifying the effectiveness
and robustness of the design.
Related papers
- LAKD-Activation Mapping Distillation Based on Local Learning [12.230042188890838]
This paper proposes a novel knowledge distillation framework, Local Attention Knowledge Distillation (LAKD)
LAKD more efficiently utilizes the distilled information from teacher networks, achieving higher interpretability and competitive performance.
We conducted experiments on the CIFAR-10, CIFAR-100, and ImageNet datasets, and the results show that our LAKD method significantly outperforms existing methods.
arXiv Detail & Related papers (2024-08-21T09:43:27Z) - I2CKD : Intra- and Inter-Class Knowledge Distillation for Semantic Segmentation [1.433758865948252]
This paper proposes a new knowledge distillation method tailored for image semantic segmentation, termed Intra- and Inter-Class Knowledge Distillation (I2CKD)
The focus of this method is on capturing and transferring knowledge between the intermediate layers of teacher (cumbersome model) and student (compact model)
arXiv Detail & Related papers (2024-03-27T12:05:22Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - DisWOT: Student Architecture Search for Distillation WithOut Training [0.0]
We explore a novel training-free framework to search for the best student architectures for a given teacher.
Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation.
Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces.
arXiv Detail & Related papers (2023-03-28T01:58:45Z) - Teaching What You Should Teach: A Data-Based Distillation Method [20.595460553747163]
We introduce the "Teaching what you Should Teach" strategy into a knowledge distillation framework.
We propose a data-based distillation method named "TST" that searches for desirable augmented samples to assist in distilling more efficiently and rationally.
To be specific, we design a neural network-based data augmentation module with priori bias, which assists in finding what meets the teacher's strengths but the student's weaknesses.
arXiv Detail & Related papers (2022-12-11T06:22:14Z) - Channel Self-Supervision for Online Knowledge Distillation [14.033675223173933]
We propose a novel online knowledge distillation method, textbfChannel textbfSelf-textbfSupervision for Online Knowledge Distillation (CSS)
We construct a dual-network multi-branch structure and enhance inter-branch diversity through self-supervised learning.
Our method provides greater diversity than OKDDip and we also give pretty performance improvement, even over the state-of-the-art such as PCL.
arXiv Detail & Related papers (2022-03-22T12:35:20Z) - Exploring Inter-Channel Correlation for Diversity-preserved
KnowledgeDistillation [91.56643684860062]
Inter-Channel Correlation for Knowledge Distillation(ICKD) is developed.
ICKD captures intrinsic distribution of the featurespace and sufficient diversity properties of features in the teacher network.
We are the first method based on knowl-edge distillation boosts ResNet18 beyond 72% Top-1 ac-curacy on ImageNet classification.
arXiv Detail & Related papers (2022-02-08T07:01:56Z) - Distilling a Powerful Student Model via Online Knowledge Distillation [158.68873654990895]
Existing online knowledge distillation approaches either adopt the student with the best performance or construct an ensemble model for better holistic performance.
We propose a novel method for online knowledge distillation, termed FFSD, which comprises two key components: Feature Fusion and Self-Distillation.
arXiv Detail & Related papers (2021-03-26T13:54:24Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.