Related papers: Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

URL: http://arxiv.org/abs/2404.02657v2
Date: Sun, 16 Jun 2024 14:32:48 GMT
Title: Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
Authors: Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, Ngai Wong,
Abstract summary: Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs) Contrary to prior assertions, reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence. We propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL.
Score: 19.99524316407591
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.

Related papers

ToDi: Token-wise Distillation via Fine-Grained Divergence Control [3.6152232645741025]
Token-wise Distillation (ToDi) is a novel method that adaptively combines Forward KL and Reverse KL per token using a sigmoid-based weighting function.<n>ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies.
arXiv Detail & Related papers (2025-05-22T06:51:16Z)
Head-Tail-Aware KL Divergence in Knowledge Distillation for Spiking Neural Networks [4.943844247308908]
Spiking Neural Networks (SNNs) have emerged as a promising approach for energy-efficient computation. SNNs often exhibit a performance gap when compared to Artificial Neural Networks (ANNs)
arXiv Detail & Related papers (2025-04-29T05:36:32Z)
Better Estimation of the KL Divergence Between Language Models [58.7977683502207]
Estimating the Kullback--Leibler (KL) divergence between language models has many applications. We introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator.
arXiv Detail & Related papers (2025-04-14T18:40:02Z)
Generalized Kullback-Leibler Divergence Loss [105.66549870868971]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement.
arXiv Detail & Related papers (2025-03-11T04:43:33Z)
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF [52.519416266840814]
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning. We show that a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.
arXiv Detail & Related papers (2024-11-07T11:22:46Z)
Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs) We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence. We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z)
Sinkhorn Distance Minimization for Knowledge Distillation [97.64216712016571]
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs) In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
arXiv Detail & Related papers (2024-02-27T01:13:58Z)
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning [105.77733287326308]
We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
arXiv Detail & Related papers (2023-10-01T12:02:59Z)
Decoupled Kullback-Leibler Divergence Loss [90.54331083430597]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss. We introduce class-wise global information into KL/DKL to bias from individual samples. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard.
arXiv Detail & Related papers (2023-05-23T11:17:45Z)
RL with KL penalties is better viewed as Bayesian inference [4.473139775790299]
We analyze challenges associated with treating a language model as anReinforcement Learning policy. We show how avoiding those challenges requires moving beyond the RL paradigm.
arXiv Detail & Related papers (2022-05-23T12:47:13Z)
Variational Refinement for Importance Sampling Using the Forward Kullback-Leibler Divergence [77.06203118175335]
Variational Inference (VI) is a popular alternative to exact sampling in Bayesian inference. Importance sampling (IS) is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures. We propose a novel combination of optimization and sampling techniques for approximate Bayesian inference.
arXiv Detail & Related papers (2021-06-30T11:00:24Z)
Leverage the Average: an Analysis of KL Regularization in RL [44.01222241795292]
We show that Kullback-Leibler (KL) regularization implicitly averages q-values. We provide a very strong performance bound, the very first to combine two desirable aspects. Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study.
arXiv Detail & Related papers (2020-03-31T10:55:06Z)
Markovian Score Climbing: Variational Inference with KL(p||q) [16.661889249333676]
We develop a simple algorithm for reliably minimizing the "exclusive Kullback-Leibler (KL)" KL(p q) This method converges to a local optimum of the inclusive KL. It does not suffer from the systematic errors inherent in existing methods, such as Reweighted Wake-Sleep and Neural Adaptive Monte Carlo.
arXiv Detail & Related papers (2020-03-23T16:38:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.