Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
- URL: http://arxiv.org/abs/2404.02657v2
- Date: Sun, 16 Jun 2024 14:32:48 GMT
- Title: Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
- Authors: Taiqiang Wu, Chaofan Tao, Jiahao Wang, Zhe Zhao, Ngai Wong,
- Abstract summary: Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs)
Contrary to prior assertions, reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence.
We propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL.
- Score: 19.99524316407591
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.
Related papers
- Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning [41.59855801010565]
Large multimodal models (LMMs) are highly robust against natural distribution shifts.
Despite this, domain-specific adaptation is still necessary, particularly in specialized areas like healthcare.
This work investigates in-context learning (ICL) as an effective alternative for enhancing LMMs' adaptability.
arXiv Detail & Related papers (2024-05-20T17:59:21Z) - Sinkhorn Distance Minimization for Knowledge Distillation [97.64216712016571]
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs)
In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation.
We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
arXiv Detail & Related papers (2024-02-27T01:13:58Z) - Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning [105.77733287326308]
We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following.
We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations.
Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
arXiv Detail & Related papers (2023-10-01T12:02:59Z) - Faithful Explanations of Black-box NLP Models Using LLM-generated
Counterfactuals [67.64770842323966]
Causal explanations of predictions of NLP systems are essential to ensure safety and establish trust.
Existing methods often fall short of explaining model predictions effectively or efficiently.
We propose two approaches for counterfactual (CF) approximation.
arXiv Detail & Related papers (2023-10-01T07:31:04Z) - Decoupled Kullback-Leibler Divergence Loss [75.31157286595517]
Kullback-Leibler (KL) Divergence loss is equivalent to the Doupled Kullback-Leibler (DKL) Divergence loss.
We introduce global information into DKL for intra-class consistency regularization.
The proposed approach achieves new state-of-the-art performance on both tasks, demonstrating the substantial practical merits.
arXiv Detail & Related papers (2023-05-23T11:17:45Z) - RL with KL penalties is better viewed as Bayesian inference [4.473139775790299]
We analyze challenges associated with treating a language model as anReinforcement Learning policy.
We show how avoiding those challenges requires moving beyond the RL paradigm.
arXiv Detail & Related papers (2022-05-23T12:47:13Z) - Variational Refinement for Importance Sampling Using the Forward
Kullback-Leibler Divergence [77.06203118175335]
Variational Inference (VI) is a popular alternative to exact sampling in Bayesian inference.
Importance sampling (IS) is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures.
We propose a novel combination of optimization and sampling techniques for approximate Bayesian inference.
arXiv Detail & Related papers (2021-06-30T11:00:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.