Related papers: Critique-out-Loud Reward Models

Critique-out-Loud Reward Models

URL: http://arxiv.org/abs/2408.11791v1
Date: Wed, 21 Aug 2024 17:24:15 GMT
Title: Critique-out-Loud Reward Models
Authors: Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, Prithviraj Ammanabrolu,
Abstract summary: We introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models.
Score: 20.631830494414096
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.

Related papers

Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference [27.205035058481553]
We propose assigning scores to every sentence, introducing an intermediate-grained reward model. A novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score. Our method outperforms the response-level reward model by 2.7% on RewardBench.
arXiv Detail & Related papers (2025-03-01T14:11:04Z)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs) We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z)
Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision. Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z)
Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives [14.401557416713315]
We revisit the foundations of using Bradley-Terry (BT) models in reward modeling. We argue that the BT model is not a necessary choice from the perspective of downstream optimization. We propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers.
arXiv Detail & Related papers (2024-11-07T18:57:03Z)
CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling [27.86204841898399]
Reward modeling in large language models is susceptible to reward hacking. We propose Context-Aware Reward Modeling (CARMO) to mitigate this problem. We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [37.97757796124621]
RM-Bench is a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. We evaluate nearly 40 reward models on RM-Bench and find that even state-of-the-art models achieve an average performance of only 46.6%.
arXiv Detail & Related papers (2024-10-21T16:48:26Z)
The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models [18.64902083536956]
We show that language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models.
arXiv Detail & Related papers (2024-10-09T05:17:08Z)
Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently. We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment [50.21842377409232]
Despite vital role reward models play in alignment, previous works have consistently overlooked their performance. This work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF. Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.
arXiv Detail & Related papers (2024-09-26T04:28:35Z)
Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only. Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)
HAF-RM: A Hybrid Alignment Framework for Reward Model Training [51.59246299566669]
We propose a hybrid alignment framework HaF-RM for reward model training. It offers a principled and effective approach to enhancing the performance and alignment of reward models.
arXiv Detail & Related papers (2024-07-04T23:26:56Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.