Critique-out-Loud Reward Models
- URL: http://arxiv.org/abs/2408.11791v1
- Date: Wed, 21 Aug 2024 17:24:15 GMT
- Title: Critique-out-Loud Reward Models
- Authors: Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, Prithviraj Ammanabrolu,
- Abstract summary: We introduce Critique-out-Loud (CLoud) reward models.
CLoud reward models operate by first generating a natural language critique of the assistant's response.
We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models.
- Score: 20.631830494414096
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.
Related papers
- Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision.
Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z) - Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives [14.401557416713315]
We revisit the foundations of using Bradley-Terry (BT) models in reward modeling.
We argue that the BT model is not a necessary choice from the perspective of downstream optimization.
We propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers.
arXiv Detail & Related papers (2024-11-07T18:57:03Z) - CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling [27.86204841898399]
Reward modeling in large language models is susceptible to reward hacking.
We propose Context-Aware Reward Modeling (CARMO) to mitigate this problem.
We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench.
arXiv Detail & Related papers (2024-10-28T21:18:49Z) - RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [37.97757796124621]
RM-Bench is a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases.
We evaluate nearly 40 reward models on RM-Bench and find that even state-of-the-art models achieve an average performance of only 46.6%.
arXiv Detail & Related papers (2024-10-21T16:48:26Z) - Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.
We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.
Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment [50.21842377409232]
Despite vital role reward models play in alignment, previous works have consistently overlooked their performance.
This work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF.
Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.
arXiv Detail & Related papers (2024-09-26T04:28:35Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - HAF-RM: A Hybrid Alignment Framework for Reward Model Training [51.59246299566669]
We propose a hybrid alignment framework HaF-RM for reward model training.
It offers a principled and effective approach to enhancing the performance and alignment of reward models.
arXiv Detail & Related papers (2024-07-04T23:26:56Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.