Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
- URL: http://arxiv.org/abs/2510.01367v3
- Date: Tue, 07 Oct 2025 17:08:11 GMT
- Title: Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
- Authors: Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He,
- Abstract summary: Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat.<n>To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation)<n>Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task.
- Score: 44.34183850072512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less 'effort' than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.
Related papers
- IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking [67.20568716300272]
Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
arXiv Detail & Related papers (2026-02-23T01:14:53Z) - Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models [59.6715047267181]
Small reasoning models (SRMs) are prone to hallucinations, especially in intermediate reasoning steps.<n>Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained chain-of-thought evaluation.<n>We propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model.
arXiv Detail & Related papers (2026-02-05T17:15:12Z) - Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking [69.06218054848803]
We propose Adrial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game.<n>ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations.<n>ARA achieves the best alignment-utility tradeoff among all baselines.
arXiv Detail & Related papers (2026-02-02T07:34:57Z) - Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks [1.4291137439893342]
Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs.<n>CoT is also a powerful tool for monitoring the behaviours of these agents.<n>We show that optimisation pressures on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property.
arXiv Detail & Related papers (2026-01-30T15:34:14Z) - Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z) - Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning [8.677768413982802]
Language models can engage in reward hacking without revealing this behavior in their chain-of-thought reasoning.<n>We propose verbalization fine-tuning (VFT) to train models to explicitly acknowledge when they are influenced by prompt cues.<n>Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection.
arXiv Detail & Related papers (2025-06-28T06:37:10Z) - Inference-Time Reward Hacking in Large Language Models [18.461698175682987]
Reward models function as proxies for complex desiderata such as correctness, helpfulness, and safety.<n>By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance.<n>We introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter and avoid reward hacking.
arXiv Detail & Related papers (2025-06-24T02:05:25Z) - Reasoning Models Don't Always Say What They Think [48.05987314492555]
Chain-of-thought (CoT) allows monitoring a model's intentions and reasoning processes.<n>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in prompts.
arXiv Detail & Related papers (2025-05-08T16:51:43Z) - Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning [25.817231106021552]
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks.<n>However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning.<n>In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning.
arXiv Detail & Related papers (2025-04-21T17:59:02Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret [64.04721528586747]
We show that a sufficiently low expected test error of the reward model guarantees low worst-case regret.<n>We then show that similar problems persist even when using policy regularization techniques.
arXiv Detail & Related papers (2024-06-22T06:43:51Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.