Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
- URL: http://arxiv.org/abs/2602.01750v1
- Date: Mon, 02 Feb 2026 07:34:57 GMT
- Title: Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
- Authors: Mohammad Beigi, Ming Jin, Junshan Zhang, Qifan Wang, Lifu Huang,
- Abstract summary: We propose Adrial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game.<n>ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations.<n>ARA achieves the best alignment-utility tradeoff among all baselines.
- Score: 69.06218054848803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines: reducing sycophancy to near-SFT levels while improving helpfulness, decreasing verbosity while achieving the highest ROUGE-L, and suppressing code gaming while improving Pass@1. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize across domains -- a Hacker trained on code gaming exhibits increased sycophancy despite no reward for this behavior, and an Auditor trained on one domain effectively suppresses exploitation in others, enabling efficient multi-domain defense with a single model.
Related papers
- IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking [67.20568716300272]
Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
arXiv Detail & Related papers (2026-02-23T01:14:53Z) - Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z) - Reward Hacking Mitigation using Verifiable Composite Rewards [5.061948558533868]
Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision.<n>This work addresses two primary forms of this behavior: i.<n>providing a final answer without preceding reasoning, and ii. employing non-standard reasoning formats to exploit the reward mechanism.
arXiv Detail & Related papers (2025-09-19T03:40:27Z) - Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study [2.1797343876622097]
Reward hacking in Reinforcement Learning systems poses a critical threat to the deployment of autonomous agents.<n>This paper presents a large-scale empirical study of reward hacking across diverse RL environments and algorithms.
arXiv Detail & Related papers (2025-07-08T03:00:02Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Preference As Reward (PAR) is a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning.<n>On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches.
arXiv Detail & Related papers (2025-02-26T02:57:59Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.