Related papers: Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

URL: http://arxiv.org/abs/2512.04625v1
Date: Thu, 04 Dec 2025 09:56:25 GMT
Title: Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective
Authors: Bowen Zheng, Ran Cheng,
Abstract summary: Decoupled Knowledge Distillation (DKD) re-emphasizes the importance of logit knowledge through advanced decoupling and strategies.<n>We introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss.<n>We demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods.
Score: 9.10299144143817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.

Related papers

Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation [73.32435804067883]
Generalizable Knowledge Distillation (GKD) is a multi-stage framework that explicitly enhances generalization.<n>Experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods.
arXiv Detail & Related papers (2026-03-03T03:18:12Z)
Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation [13.970649308150895]
Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model.<n>We propose a feature KD framework for training the student's backbone using feature-based losses exclusively.<n>We demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches.
arXiv Detail & Related papers (2025-11-18T23:50:31Z)
TopKD: Top-scaled Knowledge Distillation [12.589031892370809]
Top-scaled Knowledge Distillation (TopKD) is a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation.<n> experiments on CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet demonstrate that TopKD consistently surpasses state-of-the-art distillation methods.
arXiv Detail & Related papers (2025-08-06T15:26:08Z)
On Distilling the Displacement Knowledge for Few-Shot Class-Incremental Learning [17.819582979803286]
Few-shot Class-Incremental Learning (FSCIL) addresses the challenges of evolving data distributions and the difficulty of data acquisition in real-world scenarios.<n>To counteract the catastrophic forgetting typically encountered in FSCIL, knowledge distillation is employed as a way to maintain the knowledge from learned data distribution.
arXiv Detail & Related papers (2024-12-15T02:10:18Z)
Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD) RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals. Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z)
Grouped Knowledge Distillation for Deep Face Recognition [53.57402723008569]
The light-weight student network has difficulty fitting the target logits due to its low model capacity. We propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation.
arXiv Detail & Related papers (2023-04-10T09:04:38Z)
Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model. We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions. Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z)
Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model. Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far. In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not. We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A) In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them. Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.