Zephyr: Direct Distillation of LM Alignment
- URL: http://arxiv.org/abs/2310.16944v1
- Date: Wed, 25 Oct 2023 19:25:16 GMT
- Title: Zephyr: Direct Distillation of LM Alignment
- Authors: Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani,
Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Cl\'ementine
Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush,
and Thomas Wolf
- Abstract summary: We aim to produce a smaller language model that is aligned to user intent.
Previous research has shown that applying supervised fine-tuning (dSFT) on larger models significantly improves task accuracy.
We apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment.
- Score: 59.03530095974505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We aim to produce a smaller language model that is aligned to user intent.
Previous research has shown that applying distilled supervised fine-tuning
(dSFT) on larger models significantly improves task accuracy; however, these
models are unaligned, i.e. they do not respond well to natural prompts. To
distill this property, we experiment with the use of preference data from AI
Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model,
we apply distilled direct preference optimization (dDPO) to learn a chat model
with significantly improved intent alignment. The approach requires only a few
hours of training without any additional sampling during fine-tuning. The final
result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B
parameter models, and requires no human annotation. In particular, results on
MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access
RLHF-based model. Code, models, data, and tutorials for the system are
available at https://github.com/huggingface/alignment-handbook.
Related papers
- HelpSteer2-Preference: Complementing Ratings with Preferences [45.01567242039055]
Reward models are critical for aligning models to follow instructions.
There is a lack of evidence that either approach is better than the other when adequately matched for data.
We propose a novel approach to combine Bradley-hugging and Regression reward modeling.
arXiv Detail & Related papers (2024-10-02T06:05:52Z) - Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities.
We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z) - Preference Alignment with Flow Matching [23.042382086241364]
Preference Flow Matching (PFM) is a new framework for preference-based reinforcement learning (PbRL)
It streamlines the integration of preferences into an arbitrary class of pre-trained models.
We provide theoretical insights that support our method's alignment with standard PbRL objectives.
arXiv Detail & Related papers (2024-05-30T08:16:22Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.