Show, Attend and Distill:Knowledge Distillation via Attention-based
Feature Matching
- URL: http://arxiv.org/abs/2102.02973v1
- Date: Fri, 5 Feb 2021 03:07:57 GMT
- Title: Show, Attend and Distill:Knowledge Distillation via Attention-based
Feature Matching
- Authors: Mingi Ji, Byeongho Heo, Sungrae Park
- Abstract summary: Most studies manually tie intermediate features of the teacher and student, and transfer knowledge through pre-defined links.
We introduce an effective and efficient feature distillation method utilizing all the feature levels of the teacher without manually selecting the links.
- Score: 14.666392130118307
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Knowledge distillation extracts general knowledge from a pre-trained teacher
network and provides guidance to a target student network. Most studies
manually tie intermediate features of the teacher and student, and transfer
knowledge through pre-defined links. However, manual selection often constructs
ineffective links that limit the improvement from the distillation. There has
been an attempt to address the problem, but it is still challenging to identify
effective links under practical scenarios. In this paper, we introduce an
effective and efficient feature distillation method utilizing all the feature
levels of the teacher without manually selecting the links. Specifically, our
method utilizes an attention-based meta-network that learns relative
similarities between features, and applies identified similarities to control
distillation intensities of all possible pairs. As a result, our method
determines competent links more efficiently than the previous approach and
provides better performance on model compression and transfer learning tasks.
Further qualitative analyses and ablative studies describe how our method
contributes to better distillation. The implementation code is available at
github.com/clovaai/attention-feature-distillation.
Related papers
- Distillation versus Contrastive Learning: How to Train Your Rerankers [37.43565487845178]
Two strategies are widely used to train text rerankers: contrastive learning and knowledge distillation.<n>This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data.<n>Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model.
arXiv Detail & Related papers (2025-07-11T06:28:35Z) - Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - The Staged Knowledge Distillation in Video Classification: Harmonizing
Student Progress by a Complementary Weakly Supervised Framework [21.494759678807686]
We propose a new weakly supervised learning framework for knowledge distillation in video classification.
Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages.
Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
arXiv Detail & Related papers (2023-07-11T12:10:42Z) - Normalized Feature Distillation for Semantic Segmentation [6.882655287146012]
We propose a simple yet effective feature distillation method called normalized feature distillation (NFD)
Our method achieves state-of-the-art distillation results for semantic segmentation on Cityscapes, VOC 2012, and ADE20K datasets.
arXiv Detail & Related papers (2022-07-12T01:54:25Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Students are the Best Teacher: Exit-Ensemble Distillation with
Multi-Exits [25.140055086630838]
This paper proposes a novel knowledge distillation-based learning method to improve the classification performance of convolutional neural networks (CNNs)
Unlike the conventional notion of distillation where teachers only teach students, we show that students can also help other students and even the teacher to learn better.
arXiv Detail & Related papers (2021-04-01T07:10:36Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT)
It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way.
The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model.
While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help?
We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.