Related papers: Direct Preference Knowledge Distillation for Large Language Models

Direct Preference Knowledge Distillation for Large Language Models

URL: http://arxiv.org/abs/2406.19774v1
Date: Fri, 28 Jun 2024 09:23:40 GMT
Title: Direct Preference Knowledge Distillation for Large Language Models
Authors: Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei,
Abstract summary: We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs) We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence. We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
Score: 73.50849692633953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.

Related papers

KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning [72.53466291156604]
We present textbfKDRL, a textitunified post-training framework that jointly optimize a reasoning model through teacher supervision (KD) and self-exploration (RL)<n>We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance.
arXiv Detail & Related papers (2025-06-02T19:46:41Z)
A Dual-Space Framework for General Knowledge Distillation of Large Language Models [98.73585104789217]
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. The current white-box KD framework exhibits two limitations. We propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD.
arXiv Detail & Related papers (2025-04-15T17:38:47Z)
LLM-NEO: Parameter Efficient Knowledge Distillation for Large Language Models [54.86076216773461]
Knowledge distillation (KD) has been a predominant method for compressing Large Language Models (LLMs) Inspired by this observation, we propose a parameter-efficient knowledge distillation method, LLM-NEO, which integrates LoRA into KD to improve the efficiency of knowledge transfer.
arXiv Detail & Related papers (2024-11-11T10:07:51Z)
Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment [10.104085497265004]
We propose Ranking Loss based Knowledge Distillation (RLKD), which encourages consistency of peak predictions between the teacher and student models. Our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.
arXiv Detail & Related papers (2024-09-19T08:06:42Z)
PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z)
Sinkhorn Distance Minimization for Knowledge Distillation [97.64216712016571]
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs) In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
arXiv Detail & Related papers (2024-02-27T01:13:58Z)
A Survey on Knowledge Distillation of Large Language Models [99.11900233108487]
Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities to open-source models. This paper presents a comprehensive survey of KD's role within the realm of Large Language Models (LLMs)
arXiv Detail & Related papers (2024-02-20T16:17:37Z)
DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models. DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z)
ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift [7.256448072529497]
Knowledge Distillation (KD) transfers knowledge from large models to small models and has recently achieved remarkable success.<n>However, the reliability of existing KD methods in real-world applications, especially under distribution shift, remains underexplored.<n>We propose a unified and systematic framework textscShiftKD to benchmark KD against two general distributional shifts.
arXiv Detail & Related papers (2023-12-25T10:43:31Z)
MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs) We propose a KD approach that distills LLMs into smaller language models. Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z)
Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective [7.481220126953329]
Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. We propose a simple yet effective consistency-regularized ILD, which prevents the student model from overfitting the training dataset.
arXiv Detail & Related papers (2023-02-03T04:09:22Z)
Extending Label Smoothing Regularization with Self-Knowledge Distillation [11.009345791558601]
We propose an algorithm LsrKD for training boost by extending the LSR method to the KD regime and applying a softer temperature. To further improve the performance of LsrKD, we develop a self-distillation method named Memory-replay Knowledge Distillation (MrKD) Our experiments show that LsrKD can improve LSR performance consistently at no cost, especially on several deep neural networks where LSR is ineffectual.
arXiv Detail & Related papers (2020-09-11T04:23:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.