Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head
- URL: http://arxiv.org/abs/2411.08937v2
- Date: Wed, 28 May 2025 07:47:50 GMT
- Title: Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head
- Authors: Penghui Yang, Chen-Chen Zong, Sheng-Jun Huang, Lei Feng, Bo An,
- Abstract summary: We introduce a logit-level loss function as a supplement to the widely used probability-level loss function.<n>We find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration.<n>We propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses.
- Score: 38.898038672237746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts. Our code is available at: https://github.com/penghui-yang/DHKD.
Related papers
- How does Labeling Error Impact Contrastive Learning? A Perspective from Data Dimensionality Reduction [29.43826752911795]
This paper investigates the theoretical impact of labeling error on the downstream classification performance of contrastive learning.<n>To mitigate these impacts, data dimensionality reduction method (e.g., singular value decomposition) is applied on original data to reduce false positive samples.<n>It is also found that SVD acts as a double-edged sword, which may lead to the deterioration of downstream classification accuracy due to the reduced connectivity of the augmentation graph.
arXiv Detail & Related papers (2025-07-15T10:09:55Z) - Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law [49.25050966412749]
We introduce Capability Salience Vector, which decomposes the overall loss and assigns different importance weights to tokens to assess a specific meta-capability.<n>Experiments on various popular benchmarks demonstrate that our proposed Capability Salience Vector could significantly improve the predictability of language model performance on downstream tasks.
arXiv Detail & Related papers (2025-06-16T08:16:03Z) - Generalized Kullback-Leibler Divergence Loss [105.66549870868971]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss.<n>Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement.
arXiv Detail & Related papers (2025-03-11T04:43:33Z) - LEARN: An Invex Loss for Outlier Oblivious Robust Online Optimization [56.67706781191521]
An adversary can introduce outliers by corrupting loss functions in an arbitrary number of k, unknown to the learner.
We present a robust online rounds optimization framework, where an adversary can introduce outliers by corrupting loss functions in an arbitrary number of k, unknown.
arXiv Detail & Related papers (2024-08-12T17:08:31Z) - Robust Loss Functions for Training Decision Trees with Noisy Labels [4.795403008763752]
We consider training decision trees using noisily labeled data, focusing on loss functions that can lead to robust learning algorithms.
First, we offer novel theoretical insights on the robustness of many existing loss functions in the context of decision tree learning.
Second, we introduce a framework for constructing robust loss functions, called distribution losses.
arXiv Detail & Related papers (2023-12-20T11:27:46Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Robust Contrastive Learning With Theory Guarantee [25.57187964518637]
Contrastive learning (CL) is a self-supervised training paradigm that allows us to extract meaningful features without any label information.
Our work develops rigorous theories to dissect and identify which components in the unsupervised loss can help improve the robust supervised loss.
arXiv Detail & Related papers (2023-11-16T08:39:58Z) - Studying the Interplay between Information Loss and Operation Loss in
Representations for Classification [15.369895042965261]
Information-theoretic measures have been widely adopted in the design of features for learning and decision problems.
We show that it is possible to adopt an alternative notion of informational sufficiency to achieve operational sufficiency in learning.
arXiv Detail & Related papers (2021-12-30T23:17:05Z) - Understanding Square Loss in Training Overparametrized Neural Network
Classifiers [31.319145959402462]
We contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks.
We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error.
The resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness.
arXiv Detail & Related papers (2021-12-07T12:12:30Z) - On Codomain Separability and Label Inference from (Noisy) Loss Functions [11.780563744330038]
We introduce the notion of codomain separability to study the necessary and sufficient conditions under which label inference is possible from any (noisy) loss function values.
We show that for many commonly used loss functions, including multiclass cross-entropy with common activation functions and some Bregman divergence-based losses, it is possible to design label inference attacks for arbitrary noise levels.
arXiv Detail & Related papers (2021-07-07T05:29:53Z) - Leveraged Weighted Loss for Partial Label Learning [64.85763991485652]
Partial label learning deals with data where each instance is assigned with a set of candidate labels, whereas only one of them is true.
Despite many methodology studies on learning from partial labels, there still lacks theoretical understandings of their risk consistent properties.
We propose a family of loss functions named textitd weighted (LW) loss, which for the first time introduces the leverage parameter $beta$ to consider the trade-off between losses on partial labels and non-partial ones.
arXiv Detail & Related papers (2021-06-10T13:25:13Z) - Lower-bounded proper losses for weakly supervised classification [73.974163801142]
We discuss the problem of weakly supervised learning of classification, in which instances are given weak labels.
We derive a representation theorem for proper losses in supervised learning, which dualizes the Savage representation.
We experimentally demonstrate the effectiveness of our proposed approach, as compared to improper or unbounded losses.
arXiv Detail & Related papers (2021-03-04T08:47:07Z) - A Symmetric Loss Perspective of Reliable Machine Learning [87.68601212686086]
We review how a symmetric loss can yield robust classification from corrupted labels in balanced error rate (BER) minimization.
We demonstrate how the robust AUC method can benefit natural language processing in the problem where we want to learn only from relevant keywords.
arXiv Detail & Related papers (2021-01-05T06:25:47Z) - T-Norms Driven Loss Functions for Machine Learning [19.569025323453257]
A class of neural-symbolic approaches is based on First-Order Logic to represent prior knowledge.
This paper shows that the loss function expressing these neural-symbolic learning tasks can be unambiguously determined.
arXiv Detail & Related papers (2019-07-26T10:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.