Self-Rewarding Language Models
- URL: http://arxiv.org/abs/2401.10020v2
- Date: Thu, 8 Feb 2024 10:19:53 GMT
- Title: Self-Rewarding Language Models
- Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar
Sukhbaatar, Jing Xu, Jason Weston
- Abstract summary: We study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training.
We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself.
- Score: 105.6830788170348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We posit that to achieve superhuman agents, future models require superhuman
feedback in order to provide an adequate training signal. Current approaches
commonly train reward models from human preferences, which may then be
bottlenecked by human performance level, and secondly these separate frozen
reward models cannot then learn to improve during LLM training. In this work,
we study Self-Rewarding Language Models, where the language model itself is
used via LLM-as-a-Judge prompting to provide its own rewards during training.
We show that during Iterative DPO training that not only does instruction
following ability improve, but also the ability to provide high-quality rewards
to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a
model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard,
including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still
to explore, this work opens the door to the possibility of models that can
continually improve in both axes.
Related papers
- Self-Evolved Reward Learning for LLMs [45.6910747154447]
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences.
We propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself.
Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance.
arXiv Detail & Related papers (2024-11-01T07:29:03Z) - On Designing Effective RL Reward at Training Time for LLM Reasoning [14.006845442313134]
We evaluate popular reward models for RL training, including the Outcome-supervised Reward Model (ORM) and the Process-supervised Reward Model (PRM)
Surprisingly, even though these learned reward models have strong inference-time performances, they may NOT help or even hurt RL training.
We introduce two novel reward refinement techniques, including Clipping and Delta.
arXiv Detail & Related papers (2024-10-19T13:53:50Z) - The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models [18.64902083536956]
We show that language models trained with moderately accurate reward models outperform those guided by highly accurate ones.
This challenges the widely held belief that stronger reward models always lead to better language models.
arXiv Detail & Related papers (2024-10-09T05:17:08Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge [77.9094410773789]
Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains.
Recent self-rewarding mechanisms have shown that LLMs can improve by judging their own responses instead of relying on human labelers.
We introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills.
arXiv Detail & Related papers (2024-07-28T21:58:28Z) - Bootstrapping Language Models with DPO Implicit Rewards [45.68366127605774]
Direct preference optimization (DPO) has greatly simplified the process from past work in reinforcement learning from human feedback.
In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM.
Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance.
arXiv Detail & Related papers (2024-06-14T06:57:18Z) - Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference.
We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models.
We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Ensembling Off-the-shelf Models for GAN Training [55.34705213104182]
We find that pretrained computer vision models can significantly improve performance when used in an ensemble of discriminators.
We propose an effective selection mechanism, by probing the linear separability between real and fake samples in pretrained model embeddings.
Our method can improve GAN training in both limited data and large-scale settings.
arXiv Detail & Related papers (2021-12-16T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.