Related papers: Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models

Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models

URL: http://arxiv.org/abs/2502.14272v1
Date: Thu, 20 Feb 2025 05:18:23 GMT
Title: Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models
Authors: Yanggan Gu, Junzhuo Li, Sirui Huang, Xin Zou, Zhenghua Li, Xuming Hu,
Abstract summary: We propose a Preference-Aligned Distillation framework, which models teacher's preference knowledge as a probability distribution over all potential preferences.<n>Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches.
Score: 22.613040767122225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning small language models (SLMs) with human values typically involves distilling preference knowledge from large language models (LLMs). However, existing distillation methods model preference knowledge in teacher LLMs by comparing pairwise responses, overlooking the extent of difference between responses. This limitation hinders student SLMs from capturing the nuanced preferences for multiple responses. In this paper, we propose a Preference-Aligned Distillation (PAD) framework, which models teacher's preference knowledge as a probability distribution over all potential preferences, thereby providing more nuanced supervisory signals. Our insight in developing PAD is rooted in the demonstration that language models can serve as reward functions, reflecting their intrinsic preferences. Based on this, PAD comprises three key steps: (1) sampling diverse responses using high-temperature; (2) computing rewards for both teacher and student to construct their intrinsic preference; and (3) training the student's intrinsic preference distribution to align with the teacher's. Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches, achieving over 20\% improvement on AlpacaEval 2 and Arena-Hard, indicating superior alignment with human preferences. Notably, on MT-Bench, using the \textsc{Gemma} model family, the student trained by PAD surpasses its teacher, further validating the effectiveness of our PAD.

Related papers

Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model [12.063078727764045]
We argue that alignment via reinforcement learning from human feedback lacks theoretical justification and incentivizes deterministic solutions.<n>We propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization.<n>We empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO.
arXiv Detail & Related papers (2025-06-02T10:36:31Z)
Advantage-Guided Distillation for Preference Alignment in Small Language Models [37.1672515839325]
We propose to utilize a well-aligned teacher LLM to guide the alignment process for Small Language Models. Our experimental results show that these two approaches appreciably improve the alignment of SLMs and narrow the performance gap with larger counterparts.
arXiv Detail & Related papers (2025-02-25T07:47:22Z)
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z)
Joint Training for Selective Prediction [5.662924503089369]
Selective Prediction methods determine when to adopt a classifier's output versus defer to a human. One previous method involves learning a deferral model based on engineered features. We introduce a novel joint-training approach that simultaneously optimize learned representations used by the classifier module and a learned deferral policy.
arXiv Detail & Related papers (2024-10-31T15:28:26Z)
Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.<n>We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.<n>Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences [12.775486996512434]
Preference-Based reinforcement learning learns directly from the preferences of human teachers regarding agent behaviors. Existing PBRL methods often learn from explicit preferences, neglecting the possibility that teachers may choose equal preferences. We propose a novel PBRL method, Multi-Type Preference Learning (MTPL), which allows simultaneous learning from equal preferences while leveraging existing methods for learning from explicit preferences.
arXiv Detail & Related papers (2024-09-11T13:43:49Z)
Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models [7.632217365130212]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various machine learning (ML) tasks. These models can produce hallucinations, particularly in domains with incomplete knowledge. We introduce DualChecker, an innovative framework designed to mitigate hallucinations and improve the performance of both teacher and student models.
arXiv Detail & Related papers (2024-08-22T12:04:04Z)
Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient. We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment [121.45689748315125]
Reinforcement Learning from Contrastive Distillation (RLCD) is a method for aligning language models without using human feedback. RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt designed to encourage violating them. We then use the preference pairs to train a preference model, which is in turn used to improve a base unaligned language model via reinforcement learning.
arXiv Detail & Related papers (2023-07-24T17:23:22Z)
Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z)
SKDBERT: Compressing BERT via Stochastic Knowledge Distillation [17.589678394344475]
We propose Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT$_rm BASE$ model by 40% while retaining 99.5% performances of language understanding and being 100% faster.
arXiv Detail & Related papers (2022-11-26T03:18:55Z)
On the Role of Bidirectionality in Language Model Pre-Training [85.14614350372004]
We study the role of bidirectionality in next token prediction, text infilling, zero-shot priming and fine-tuning. We train models with up to 6.7B parameters, and find differences to remain consistent at scale.
arXiv Detail & Related papers (2022-05-24T02:25:05Z)
Contrastive Distillation on Intermediate Representations for Language Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.