Related papers: Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

URL: http://arxiv.org/abs/2312.09244v3
Date: Fri, 16 Aug 2024 23:59:29 GMT
Title: Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
Authors: Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant,
Abstract summary: Reward models play a key role in aligning language model applications towards human preferences. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
Score: 62.146953368613815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.

Related papers

Understanding Reward Hacking in Text-to-Image Reinforcement Learning [43.358394359914314]
We analyze reward hacking behaviors in text-to-image (T2I) RL post-training.<n>Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images.<n>We propose a lightweight and adaptive artifact reward model, trained on a small dataset of artifact-free and artifact-containing samples.
arXiv Detail & Related papers (2026-01-06T23:43:47Z)
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation [52.648073272395635]
We introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator.<n>Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs.<n>In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively.
arXiv Detail & Related papers (2025-11-25T12:35:57Z)
GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning [90.99527142037853]
We develop GRAM-R$2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales.<n>GRAM-R$2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning.
arXiv Detail & Related papers (2025-09-02T16:41:07Z)
Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models [28.542061921495353]
There are two mainstream reward paradigms: model-based rewards and rule-based rewards.<n>Both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking.<n>We propose Cooper, a RL framework that jointly optimize both the policy model and the reward model.<n>Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct.
arXiv Detail & Related papers (2025-08-07T17:53:56Z)
Inference-Time Reward Hacking in Large Language Models [29.829648695171425]
Reward models function as proxies for complex desiderata such as correctness, helpfulness, and safety.<n>By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance.<n>We show that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.
arXiv Detail & Related papers (2025-06-24T02:05:25Z)
Information-Theoretic Reward Decomposition for Generalizable RLHF [51.550547285296794]
We decompose the reward value into two independent components: prompt-free reward and prompt-related reward.<n>We propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values.
arXiv Detail & Related papers (2025-04-08T13:26:07Z)
What Makes a Reward Model a Good Teacher? An Optimization Perspective [61.38643642719093]
We prove that regardless of accurate a reward model is, if it induces low reward variance, the RLHF objective suffers from a flat landscape. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another.
arXiv Detail & Related papers (2025-03-19T17:54:41Z)
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs [64.29893431743608]
We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations. We propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations.
arXiv Detail & Related papers (2025-03-14T17:59:41Z)
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference [27.205035058481553]
We propose assigning scores to every sentence, introducing an intermediate-grained reward model. A novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score. Our method outperforms the response-level reward model by 2.7% on RewardBench.
arXiv Detail & Related papers (2025-03-01T14:11:04Z)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs) We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z)
R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences. This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z)
Towards Reliable Alignment: Uncertainty-aware RLHF [14.20181662644689]
We show that the fluctuation of reward models can be detrimental to the alignment problem. We show that such policies are more risk-averse in the sense that they are more cautious of uncertain rewards. We use this ensemble of reward models to align language model using our methodology and observe that our empirical findings match our theoretical predictions.
arXiv Detail & Related papers (2024-10-31T08:26:51Z)
Evaluating Robustness of Reward Models for Mathematical Reasoning [14.97819343313859]
We introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization.
arXiv Detail & Related papers (2024-10-02T16:39:58Z)
Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment [50.21842377409232]
Despite vital role reward models play in alignment, previous works have consistently overlooked their performance. This work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF. Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.
arXiv Detail & Related papers (2024-09-26T04:28:35Z)
HAF-RM: A Hybrid Alignment Framework for Reward Model Training [51.59246299566669]
We propose a hybrid alignment framework HaF-RM for reward model training. It offers a principled and effective approach to enhancing the performance and alignment of reward models.
arXiv Detail & Related papers (2024-07-04T23:26:56Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Transforming and Combining Rewards for Aligning Large Language Models [69.44634017612798]
A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We use a log-sigmoid function to transform rewards learned from Bradley-Terry preference models. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.
arXiv Detail & Related papers (2024-02-01T16:39:28Z)
Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z)
Scaling Laws for Reward Model Overoptimization [19.93331579503503]
We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
arXiv Detail & Related papers (2022-10-19T17:56:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.