Swapped Logit Distillation via Bi-level Teacher Alignment
- URL: http://arxiv.org/abs/2504.20108v1
- Date: Sun, 27 Apr 2025 15:52:07 GMT
- Title: Swapped Logit Distillation via Bi-level Teacher Alignment
- Authors: Stephen Ekaputra Limantoro, Jhe-Hao Lin, Chih-Yu Wang, Yi-Lung Tsai, Hong-Han Shuai, Ching-Chun Huang, Wen-Huang Cheng,
- Abstract summary: Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student)<n>We propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD)<n>We find that SLD consistently performs best among previous state-of-the-art methods.
- Score: 32.746586492281104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its original distribution, which can possibly lead to incorrect predictions. In this article, we propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD). SLD is proposed under two assumptions: (1) the wrong prediction occurs when the prediction label confidence is not the maximum; (2) the "natural" limit of probability remains uncertain as the best value addition to the target cannot be determined. To address these issues, we propose a swapped logit processing scheme. Through this approach, we find that the swap method can be effectively extended to teacher and student outputs, transforming into two teachers. We further introduce loss scheduling to boost the performance of two teachers' alignment. Extensive experiments on image classification tasks demonstrate that SLD consistently performs best among previous state-of-the-art methods.
Related papers
- Cross-Tokenizer Distillation via Approximate Likelihood Matching [17.597293085255075]
We develop a cross-tokenizer distillation method to solve this deficiency.<n>Our method is the first to enable cross-tokenizer distillation without a next-token prediction loss.<n>Our results make substantial strides toward better adaptability and enhanced interaction between different LLMs.
arXiv Detail & Related papers (2025-03-25T21:44:10Z) - Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation.<n>Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z) - Self-Evolution Knowledge Distillation for LLM-based Machine Translation [36.01859033056453]
We propose a distillation strategy called Self-Evolution KD.<n>The core of this approach involves dynamically integrating teacher distribution and one-hot distribution of ground truth into the student distribution as prior knowledge.<n> Experimental results show our method brings an average improvement of approximately 1.4 SacreBLEU points across four translation directions in the WMT22 test sets.
arXiv Detail & Related papers (2024-12-19T12:24:15Z) - Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge
Distillation [37.57793306258625]
Students learn to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution.
We argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution.
We propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series.
arXiv Detail & Related papers (2023-05-08T19:31:09Z) - Grouped Knowledge Distillation for Deep Face Recognition [53.57402723008569]
The light-weight student network has difficulty fitting the target logits due to its low model capacity.
We propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation.
arXiv Detail & Related papers (2023-04-10T09:04:38Z) - Respecting Transfer Gap in Knowledge Distillation [74.38776465736471]
Knowledge distillation (KD) is essentially a process of transferring a teacher model's behavior to a student model.
Traditional KD methods hold an underlying assumption that the data collected in both human domain and machine domain are both independent and identically distributed.
We propose Inverse Probability Weighting Distillation (IPWD) that estimates the propensity score of a training sample belonging to the machine domain.
arXiv Detail & Related papers (2022-10-23T17:05:32Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - ALP-KD: Attention-Based Layer Projection for Knowledge Distillation [30.896957367331137]
Two neural networks, namely a teacher and a student, are coupled together during training.
The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions.
In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher's supervision for internal components.
arXiv Detail & Related papers (2020-12-27T22:30:13Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student.
Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher.
We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.