Inference-Time Reward Hacking in Large Language Models
- URL: http://arxiv.org/abs/2506.19248v1
- Date: Tue, 24 Jun 2025 02:05:25 GMT
- Title: Inference-Time Reward Hacking in Large Language Models
- Authors: Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon,
- Abstract summary: Reward models function as proxies for complex desiderata such as correctness, helpfulness, and safety.<n>By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance.<n>We introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter and avoid reward hacking.
- Score: 18.461698175682987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to LLM outputs indicating, for example, which response would likely be preferred by a user or is most aligned with safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance -- a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft-Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, hedging offers a tactical choice to avoid placing undue confidence in high but potentially misleading proxy reward signals. We introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter and avoid reward hacking. We demonstrate through experiments that hedging mitigates reward hacking and achieves superior distortion-reward tradeoffs with minimal computational overhead.
Related papers
- Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models [28.542061921495353]
There are two mainstream reward paradigms: model-based rewards and rule-based rewards.<n>Both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking.<n>We propose Cooper, a RL framework that jointly optimize both the policy model and the reward model.<n>Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct.
arXiv Detail & Related papers (2025-08-07T17:53:56Z) - Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment [54.787826863212146]
Inference-time computation offers a powerful axis for scaling the performance of language models.<n>We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute.<n>We introduce $textttInferenceTimePessimism$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute.
arXiv Detail & Related papers (2025-03-27T18:00:08Z) - Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking [36.69993567249251]
We investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset.<n>We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable.<n>We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking.
arXiv Detail & Related papers (2024-12-12T18:34:47Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data.<n>We analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs.<n>Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment [7.349727826230864]
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding.<n>Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective.<n>We propose a variant of BoN that aims to mitigate reward hacking at inference time by incorporating the Minimum Bayes Risk (MBR) objective as a proximity regularization term.
arXiv Detail & Related papers (2024-04-01T11:26:50Z) - Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking [11.589217788048964]
We introduce a definition of reward hacking based on the correlation between proxy and true rewards for states.<n>We show theoretically that regularization to the reference policy can effectively prevent reward hacking.
arXiv Detail & Related papers (2024-03-05T18:22:15Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping.
The proposed framework alternates between updating a policy and inferring a reward function.
Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.