Related papers: Sinkhorn Distance Minimization for Knowledge Distillation

Sinkhorn Distance Minimization for Knowledge Distillation

URL: http://arxiv.org/abs/2402.17110v1
Date: Tue, 27 Feb 2024 01:13:58 GMT
Title: Sinkhorn Distance Minimization for Knowledge Distillation
Authors: Xiao Cui, Yulei Qin, Yuting Gao, Enwei Zhang, Zihan Xu, Tong Wu, Ke Li, Xing Sun, Wengang Zhou and Houqiang Li
Abstract summary: Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs) In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
Score: 97.64216712016571
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.

Related papers

ToDi: Token-wise Distillation via Fine-Grained Divergence Control [3.6152232645741025]
Token-wise Distillation (ToDi) is a novel method that adaptively combines Forward KL and Reverse KL per token using a sigmoid-based weighting function.<n>ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies.
arXiv Detail & Related papers (2025-05-22T06:51:16Z)
A Dual-Space Framework for General Knowledge Distillation of Large Language Models [98.73585104789217]
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. The current white-box KD framework exhibits two limitations. We propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD.
arXiv Detail & Related papers (2025-04-15T17:38:47Z)
Generalized Kullback-Leibler Divergence Loss [105.66549870868971]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement.
arXiv Detail & Related papers (2025-03-11T04:43:33Z)
Kendall's $τ$ Coefficient for Logits Distillation [33.77389987117822]
We propose a ranking loss based on Kendall's $tau$ coefficient, called Rank-Kendall Knowledge Distillation (RKKD) RKKD balances the attention to smaller-valued channels by constraining the order of channel values in student logits. Our experiments show that our RKKD can enhance the performance of various knowledge distillation baselines and offer broad improvements across multiple teacher-student architecture combinations.
arXiv Detail & Related papers (2024-09-26T13:21:02Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
Kolmogorov-Smirnov GAN [52.36633001046723]
We propose a novel deep generative model, the Kolmogorov-Smirnov Generative Adversarial Network (KSGAN) Unlike existing approaches, KSGAN formulates the learning process as a minimization of the Kolmogorov-Smirnov (KS) distance.
arXiv Detail & Related papers (2024-06-28T14:30:14Z)
Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs) We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence. We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z)
Dual-Space Knowledge Distillation for Large Language Models [39.798007795604676]
We propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD. Our framework is not only compatible with various distance functions for KD like the current framework, but also supports KD between any two LLMs regardless of their vocabularies.
arXiv Detail & Related papers (2024-06-25T07:25:15Z)
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models [18.870276152694245]
Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs) Contrary to prior assertions, reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence. We propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL.
arXiv Detail & Related papers (2024-04-03T11:40:17Z)
Decoupled Kullback-Leibler Divergence Loss [90.54331083430597]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss. We introduce class-wise global information into KL/DKL to bias from individual samples. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard.
arXiv Detail & Related papers (2023-05-23T11:17:45Z)
Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation [9.157410884444312]
Knowledge distillation (KD) has been investigated to design efficient neural architectures. We show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0. We show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise.
arXiv Detail & Related papers (2021-05-19T04:40:53Z)
KDExplainer: A Task-oriented Attention Model for Explaining Knowledge Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD. We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z)
Imitation Learning with Sinkhorn Distances [12.161649672131286]
We present tractable solutions by formulating imitation learning as minimization of the Sinkhorn distance between occupancy measures. We evaluate the proposed approach using both the reward metric and the Sinkhorn distance metric on a number of MuJoCo experiments.
arXiv Detail & Related papers (2020-08-20T19:13:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.