Reward Hacking Mitigation using Verifiable Composite Rewards
- URL: http://arxiv.org/abs/2509.15557v1
- Date: Fri, 19 Sep 2025 03:40:27 GMT
- Title: Reward Hacking Mitigation using Verifiable Composite Rewards
- Authors: Mirza Farhan Bin Tarek, Rahmatollah Beheshti,
- Abstract summary: Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision.<n>This work addresses two primary forms of this behavior: i.<n>providing a final answer without preceding reasoning, and ii. employing non-standard reasoning formats to exploit the reward mechanism.
- Score: 5.061948558533868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision. However, applications in the medical domain, specifically for question answering, are susceptible to significant reward hacking during the reasoning phase. Our work addresses two primary forms of this behavior: i) providing a final answer without preceding reasoning, and ii) employing non-standard reasoning formats to exploit the reward mechanism. To mitigate these, we introduce a composite reward function with specific penalties for these behaviors. Our experiments show that extending RLVR with our proposed reward model leads to better-formatted reasoning with less reward hacking and good accuracy compared to the baselines. This approach marks a step toward reducing reward hacking and enhancing the reliability of models utilizing RLVR.
Related papers
- IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking [67.20568716300272]
Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
arXiv Detail & Related papers (2026-02-23T01:14:53Z) - Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards [45.83885805939434]
A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior.<n>Most previous works address this by limiting the policy update with a Kullback-Leibler penalty towards a reference model.<n>We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate.
arXiv Detail & Related papers (2026-02-20T07:32:22Z) - Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking [69.06218054848803]
We propose Adrial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game.<n>ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations.<n>ARA achieves the best alignment-utility tradeoff among all baselines.
arXiv Detail & Related papers (2026-02-02T07:34:57Z) - Factored Causal Representation Learning for Robust Reward Modeling in RLHF [40.483487519518896]
A reliable reward model is essential for aligning large language models with human preferences.<n>Standard reward models are susceptible to spurious features that are not causally related to human labels.<n>This can lead to reward hacking, where high predicted reward does not translate into better behavior.
arXiv Detail & Related papers (2026-01-29T07:18:45Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction [5.518813485456855]
External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks.<n>These systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers.<n>We propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path.
arXiv Detail & Related papers (2025-08-06T08:48:55Z) - Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs [32.99709073885827]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs)<n>However, RLVR-tuned models often underperform their base models on the $Pass@K$ metric for solution-finding.<n>We introduce a more precise evaluation metric, $CoT$-$Pass@K$, which mandates that both the reasoning path and the final answer be correct.
arXiv Detail & Related papers (2025-06-17T07:06:56Z) - Spurious Rewards: Rethinking Training Signals in RLVR [130.3484456088909]
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models.<n>For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4%.<n>We find code reasoning -- thinking in code without actual code execution -- to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR.
arXiv Detail & Related papers (2025-06-12T17:49:55Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning [25.817231106021552]
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks.<n>However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning.<n>In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning.
arXiv Detail & Related papers (2025-04-21T17:59:02Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking [11.589217788048964]
We introduce a definition of reward hacking based on the correlation between proxy and true rewards for states.<n>We show theoretically that regularization to the reference policy can effectively prevent reward hacking.
arXiv Detail & Related papers (2024-03-05T18:22:15Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.