Zephyr: Direct Distillation of LM Alignment
- URL: http://arxiv.org/abs/2310.16944v1
- Date: Wed, 25 Oct 2023 19:25:16 GMT
- Title: Zephyr: Direct Distillation of LM Alignment
- Authors: Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani,
Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Cl\'ementine
Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush,
and Thomas Wolf
- Abstract summary: We aim to produce a smaller language model that is aligned to user intent.
Previous research has shown that applying supervised fine-tuning (dSFT) on larger models significantly improves task accuracy.
We apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment.
- Score: 59.03530095974505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We aim to produce a smaller language model that is aligned to user intent.
Previous research has shown that applying distilled supervised fine-tuning
(dSFT) on larger models significantly improves task accuracy; however, these
models are unaligned, i.e. they do not respond well to natural prompts. To
distill this property, we experiment with the use of preference data from AI
Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model,
we apply distilled direct preference optimization (dDPO) to learn a chat model
with significantly improved intent alignment. The approach requires only a few
hours of training without any additional sampling during fine-tuning. The final
result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B
parameter models, and requires no human annotation. In particular, results on
MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access
RLHF-based model. Code, models, data, and tutorials for the system are
available at https://github.com/huggingface/alignment-handbook.
Related papers
- Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations.
Current approaches focus on using reinforcement learning with human feedback to improve model alignment.
We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z) - Preference Alignment with Flow Matching [23.042382086241364]
Preference Flow Matching (PFM) is a new framework for preference-based reinforcement learning (PbRL)
It streamlines the integration of preferences into an arbitrary class of pre-trained models.
We provide theoretical insights that support our method's alignment with standard PbRL objectives.
arXiv Detail & Related papers (2024-05-30T08:16:22Z) - Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We show that most preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors.
arXiv Detail & Related papers (2024-05-29T21:29:44Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [90.4820014819937]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages.
1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data.
2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning step to fine-tune the model.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - The Entropy Enigma: Success and Failure of Entropy Minimization [30.083332640328642]
Entropy minimization (EM) is frequently used to increase the accuracy of classification models when they're faced with new data at test time.
We analyze why EM works when adapting a model for a few steps and why it eventually fails after adapting for many steps.
We present a method for solving a practical problem: estimating a model's accuracy on a given arbitrary dataset without having access to its labels.
arXiv Detail & Related papers (2024-05-08T12:26:15Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.