Critique-out-Loud Reward Models
- URL: http://arxiv.org/abs/2408.11791v1
- Date: Wed, 21 Aug 2024 17:24:15 GMT
- Title: Critique-out-Loud Reward Models
- Authors: Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, Prithviraj Ammanabrolu,
- Abstract summary: We introduce Critique-out-Loud (CLoud) reward models.
CLoud reward models operate by first generating a natural language critique of the assistant's response.
We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models.
- Score: 20.631830494414096
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.
Related papers
- Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision.
Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z) - Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives [14.401557416713315]
We revisit the foundations of using Bradley-Terry (BT) models in reward modeling.
We argue that the BT model is not a necessary choice from the perspective of downstream optimization.
We propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers.
arXiv Detail & Related papers (2024-11-07T18:57:03Z) - RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [37.97757796124621]
RM-Bench is a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases.
We evaluate nearly 40 reward models on RM-Bench and find that even state-of-the-art models achieve an average performance of only 46.6%.
arXiv Detail & Related papers (2024-10-21T16:48:26Z) - The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models [18.64902083536956]
We show that language models trained with moderately accurate reward models outperform those guided by highly accurate ones.
This challenges the widely held belief that stronger reward models always lead to better language models.
arXiv Detail & Related papers (2024-10-09T05:17:08Z) - Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment [50.21842377409232]
Despite vital role reward models play in alignment, previous works have consistently overlooked their performance.
This work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF.
Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.
arXiv Detail & Related papers (2024-09-26T04:28:35Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - HAF-RM: A Hybrid Alignment Framework for Reward Model Training [51.59246299566669]
We propose a hybrid alignment framework HaF-RM for reward model training.
It offers a principled and effective approach to enhancing the performance and alignment of reward models.
arXiv Detail & Related papers (2024-07-04T23:26:56Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.