PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs
- URL: http://arxiv.org/abs/2406.02886v2
- Date: Thu, 6 Jun 2024 12:47:31 GMT
- Title: PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs
- Authors: Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Haorui Wang, Zhen Qin, Feng Han, Jialu Liu, Simon Baumgartner, Michael Bendersky, Chao Zhang,
- Abstract summary: Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings.
Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models.
We present PLaD, a novel preference-based LLM distillation framework.
- Score: 47.35598271306371
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. However, traditional KD techniques face specific challenges when applied to LLMs, including restricted access to LLM outputs, significant teacher-student capacity gaps, and the inherited mis-calibration issue. In this work, we present PLaD, a novel preference-based LLM distillation framework. PLaD exploits the teacher-student capacity discrepancy to generate pseudo-preference pairs where teacher outputs are preferred over student outputs. Then, PLaD leverages a ranking loss to re-calibrate student's estimation of sequence likelihood, which steers the student's focus towards understanding the relative quality of outputs instead of simply imitating the teacher. PLaD bypasses the need for access to teacher LLM's internal states, tackles the student's expressivity limitations, and mitigates the student mis-calibration issue. Through extensive experiments on two sequence generation tasks and with various LLMs, we demonstrate the effectiveness of our proposed PLaD framework.
Related papers
- Pre-training Distillation for Large Language Models: A Design Space Exploration [54.67324039434781]
Pre-training distillation aims to transfer knowledge from a large teacher model to a smaller student model.
We conduct experiments to explore the design space of pre-training distillation and find better configurations.
We hope our exploration of the design space will inform future practices in pre-training distillation.
arXiv Detail & Related papers (2024-10-21T17:16:13Z) - Mentor-KD: Making Small Language Models Better Multi-step Reasoners [15.159415340059388]
We propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs.
We exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations.
We conduct extensive experiments and confirm Mentor-KD's effectiveness across various models and complex reasoning tasks.
arXiv Detail & Related papers (2024-10-11T17:53:27Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Aligning Teacher with Student Preferences for Tailored Training Data Generation [40.85451525264779]
We propose ARTE, dubbed Aligning TeacheR with StudenT PreferencEs, to generate tailored training examples for Knowledge Distillation.
Specifically, we elicit draft questions and rationales from the teacher model, then collect student preferences on these questions and rationales.
In the end, we repeat the first step with the aligned teacher model to elicit tailored training examples for the student model on the target task.
arXiv Detail & Related papers (2024-06-27T14:51:17Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Adversarial Moment-Matching Distillation of Large Language Models [3.9160947065896803]
Knowledge distillation (KD) has been shown to be highly effective in guiding a student model with a larger teacher model.
We propose an adversarial training algorithm to jointly estimate the moment-matching distance and optimize the student policy to minimize it.
Results from both task-agnostic instruction-following experiments and task-specific experiments demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-06-05T05:27:29Z) - TPD: Enhancing Student Language Model Reasoning via Principle Discovery
and Guidance [0.0]
We introduce a principle-based teacher-student framework called Teaching via Principle Discovery'' (TPD)
Inspired by human learning mechanisms, TPD mimics the interaction between a teacher and a student using a principle-based approach.
TPD significantly improves the student model's performance, achieving $6.2%$ improvement on average.
arXiv Detail & Related papers (2024-01-24T23:11:33Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.