Sinkhorn Distance Minimization for Knowledge Distillation
- URL: http://arxiv.org/abs/2402.17110v1
- Date: Tue, 27 Feb 2024 01:13:58 GMT
- Title: Sinkhorn Distance Minimization for Knowledge Distillation
- Authors: Xiao Cui, Yulei Qin, Yuting Gao, Enwei Zhang, Zihan Xu, Tong Wu, Ke
Li, Xing Sun, Wengang Zhou and Houqiang Li
- Abstract summary: Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs)
In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation.
We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
- Score: 97.64216712016571
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Knowledge distillation (KD) has been widely adopted to compress large
language models (LLMs). Existing KD methods investigate various divergence
measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL),
and Jensen-Shannon (JS) divergences. However, due to limitations inherent in
their assumptions and definitions, these measures fail to deliver effective
supervision when few distribution overlap exists between the teacher and the
student. In this paper, we show that the aforementioned KL, RKL, and JS
divergences respectively suffer from issues of mode-averaging, mode-collapsing,
and mode-underestimation, which deteriorates logits-based KD for diverse NLP
tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the
Sinkhorn distance to ensure a nuanced and precise assessment of the disparity
between teacher and student distributions. Besides, profit by properties of the
Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception
of divergence in each teacher-student sample pair. Instead, we propose a
batch-wise reformulation to capture geometric intricacies of distributions
across samples in the high-dimensional space. Comprehensive evaluation on GLUE
and SuperGLUE, in terms of comparability, validity, and generalizability,
highlights our superiority over state-of-the-art methods on all kinds of LLMs
with encoder-only, encoder-decoder, and decoder-only architectures.
Related papers
- Kendall's $τ$ Coefficient for Logits Distillation [33.77389987117822]
We propose a ranking loss based on Kendall's $tau$ coefficient, called Rank-Kendall Knowledge Distillation (RKKD)
RKKD balances the attention to smaller-valued channels by constraining the order of channel values in student logits.
Our experiments show that our RKKD can enhance the performance of various knowledge distillation baselines and offer broad improvements across multiple teacher-student architecture combinations.
arXiv Detail & Related papers (2024-09-26T13:21:02Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Kolmogorov-Smirnov GAN [52.36633001046723]
We propose a novel deep generative model, the Kolmogorov-Smirnov Generative Adversarial Network (KSGAN)
Unlike existing approaches, KSGAN formulates the learning process as a minimization of the Kolmogorov-Smirnov (KS) distance.
arXiv Detail & Related papers (2024-06-28T14:30:14Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Dual-Space Knowledge Distillation for Large Language Models [39.798007795604676]
We propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD.
Our framework is not only compatible with various distance functions for KD like the current framework, but also supports KD between any two LLMs regardless of their vocabularies.
arXiv Detail & Related papers (2024-06-25T07:25:15Z) - Decoupled Kullback-Leibler Divergence Loss [90.54331083430597]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss.
We introduce class-wise global information into KL/DKL to bias from individual samples.
The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard.
arXiv Detail & Related papers (2023-05-23T11:17:45Z) - Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in
Knowledge Distillation [9.157410884444312]
Knowledge distillation (KD) has been investigated to design efficient neural architectures.
We show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0.
We show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise.
arXiv Detail & Related papers (2021-05-19T04:40:53Z) - KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD.
We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z) - Imitation Learning with Sinkhorn Distances [12.161649672131286]
We present tractable solutions by formulating imitation learning as minimization of the Sinkhorn distance between occupancy measures.
We evaluate the proposed approach using both the reward metric and the Sinkhorn distance metric on a number of MuJoCo experiments.
arXiv Detail & Related papers (2020-08-20T19:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.