Related papers: Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

URL: http://arxiv.org/abs/2411.08937v1
Date: Wed, 13 Nov 2024 12:33:04 GMT
Title: Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head
Authors: Penghui Yang, Chen-Chen Zong, Sheng-Jun Huang, Lei Feng, Bo An,
Abstract summary: We introduce a logit-level loss function as a supplement to the widely used probability-level loss function. We find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration. We propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses.
Score: 38.898038672237746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts.

Related papers

LEARN: An Invex Loss for Outlier Oblivious Robust Online Optimization [56.67706781191521]
An adversary can introduce outliers by corrupting loss functions in an arbitrary number of k, unknown to the learner. We present a robust online rounds optimization framework, where an adversary can introduce outliers by corrupting loss functions in an arbitrary number of k, unknown.
arXiv Detail & Related papers (2024-08-12T17:08:31Z)
Robust Loss Functions for Training Decision Trees with Noisy Labels [4.795403008763752]
We consider training decision trees using noisily labeled data, focusing on loss functions that can lead to robust learning algorithms. First, we offer novel theoretical insights on the robustness of many existing loss functions in the context of decision tree learning. Second, we introduce a framework for constructing robust loss functions, called distribution losses.
arXiv Detail & Related papers (2023-12-20T11:27:46Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Robust Contrastive Learning With Theory Guarantee [25.57187964518637]
Contrastive learning (CL) is a self-supervised training paradigm that allows us to extract meaningful features without any label information. Our work develops rigorous theories to dissect and identify which components in the unsupervised loss can help improve the robust supervised loss.
arXiv Detail & Related papers (2023-11-16T08:39:58Z)
Studying the Interplay between Information Loss and Operation Loss in Representations for Classification [15.369895042965261]
Information-theoretic measures have been widely adopted in the design of features for learning and decision problems. We show that it is possible to adopt an alternative notion of informational sufficiency to achieve operational sufficiency in learning.
arXiv Detail & Related papers (2021-12-30T23:17:05Z)
Understanding Square Loss in Training Overparametrized Neural Network Classifiers [31.319145959402462]
We contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks. We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error. The resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness.
arXiv Detail & Related papers (2021-12-07T12:12:30Z)
On Codomain Separability and Label Inference from (Noisy) Loss Functions [11.780563744330038]
We introduce the notion of codomain separability to study the necessary and sufficient conditions under which label inference is possible from any (noisy) loss function values. We show that for many commonly used loss functions, including multiclass cross-entropy with common activation functions and some Bregman divergence-based losses, it is possible to design label inference attacks for arbitrary noise levels.
arXiv Detail & Related papers (2021-07-07T05:29:53Z)
Leveraged Weighted Loss for Partial Label Learning [64.85763991485652]
Partial label learning deals with data where each instance is assigned with a set of candidate labels, whereas only one of them is true. Despite many methodology studies on learning from partial labels, there still lacks theoretical understandings of their risk consistent properties. We propose a family of loss functions named textitd weighted (LW) loss, which for the first time introduces the leverage parameter $beta$ to consider the trade-off between losses on partial labels and non-partial ones.
arXiv Detail & Related papers (2021-06-10T13:25:13Z)
Lower-bounded proper losses for weakly supervised classification [73.974163801142]
We discuss the problem of weakly supervised learning of classification, in which instances are given weak labels. We derive a representation theorem for proper losses in supervised learning, which dualizes the Savage representation. We experimentally demonstrate the effectiveness of our proposed approach, as compared to improper or unbounded losses.
arXiv Detail & Related papers (2021-03-04T08:47:07Z)
A Symmetric Loss Perspective of Reliable Machine Learning [87.68601212686086]
We review how a symmetric loss can yield robust classification from corrupted labels in balanced error rate (BER) minimization. We demonstrate how the robust AUC method can benefit natural language processing in the problem where we want to learn only from relevant keywords.
arXiv Detail & Related papers (2021-01-05T06:25:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.