Related papers: Class-aware Information for Logit-based Knowledge Distillation

Class-aware Information for Logit-based Knowledge Distillation

URL: http://arxiv.org/abs/2211.14773v1
Date: Sun, 27 Nov 2022 09:27:50 GMT
Title: Class-aware Information for Logit-based Knowledge Distillation
Authors: Shuoxi Zhang, Hanpeng Liu, John E. Hopcroft, Kun He
Abstract summary: We propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level. CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance.
Score: 16.634819319915923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation aims to transfer knowledge to the student model by utilizing the predictions/features of the teacher model, and feature-based distillation has recently shown its superiority over logit-based distillation. However, due to the cumbersome computation and storage of extra feature transformation, the training overhead of feature-based methods is much higher than that of logit-based distillation. In this work, we revisit the logit-based knowledge distillation, and observe that the existing logit-based distillation methods treat the prediction logits only in the instance level, while many other useful semantic information is overlooked. To address this issue, we propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level. CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance. We further introduce a novel loss called Class Correlation Loss to force the student learn the inherent class-level correlation of the teacher. Empirical comparisons demonstrate the superiority of the proposed method over several prevailing logit-based methods and feature-based methods, in which CLKD achieves compelling results on various visual classification tasks and outperforms the state-of-the-art baselines.

Related papers

Progressive Class-level Distillation [4.33169417430713]
We propose a Progressive Class-level Distillation (PCD) method for logit distillation.<n>Our PCD approach performs stage-wise distillation for step-by-step knowledge transfer.<n>Our method is superior to state-of-the-arts for both classification and detection tasks.
arXiv Detail & Related papers (2025-05-30T07:49:01Z)
Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model. We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z)
Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions. Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
DistillCSE: Distilled Contrastive Learning for Sentence Embeddings [32.6620719893457]
This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation. The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation. The paper proposes two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components.
arXiv Detail & Related papers (2023-10-20T13:45:59Z)
The Staged Knowledge Distillation in Video Classification: Harmonizing Student Progress by a Complementary Weakly Supervised Framework [21.494759678807686]
We propose a new weakly supervised learning framework for knowledge distillation in video classification. Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages. Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
arXiv Detail & Related papers (2023-07-11T12:10:42Z)
Knowledge Distillation from Single to Multi Labels: an Empirical Study [14.12487391004319]
We introduce a novel distillation method based on Class Activation Maps (CAMs) Our findings indicate that the logit-based method is not well-suited for multi-label classification. We propose that a suitable dark knowledge should incorporate class-wise information and be highly correlated with the final classification results.
arXiv Detail & Related papers (2023-03-15T04:39:01Z)
Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency. Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model. We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z)
Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL) Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z)
Localization Distillation for Object Detection [134.12664548771534]
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the classification logits. We present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student. We show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years.
arXiv Detail & Related papers (2022-04-12T17:14:34Z)
Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)
Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A) In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them. Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.