Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2506.22777v2
- Date: Sun, 13 Jul 2025 15:36:35 GMT
- Title: Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
- Authors: Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael,
- Abstract summary: Language models can engage in reward hacking without revealing this behavior in their chain-of-thought reasoning.<n>We propose verbalization fine-tuning (VFT) to train models to explicitly acknowledge when they are influenced by prompt cues.<n>Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection.
- Score: 8.677768413982802
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language models trained with reinforcement learning (RL) can engage in reward hacking--the exploitation of unintended strategies for high reward--without revealing this behavior in their chain-of-thought reasoning. This makes the detection of reward hacking difficult, posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL fine-tuning intervention that trains models to explicitly acknowledge when they are influenced by prompt cues--hints which point to incorrect answers (e.g., "a Stanford professor thinks the answer is A"). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to exploit these cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the VFT-trained model's responses consist of undetected reward hacks. In comparison, when we perform RL without VFT, the rate of undetected reward hacks goes up to 88%; with a debiasing baseline intervention, this increases further to 99%. VFT achieves this by substantially increasing how often models verbalize the influence of cues, from 8% to 43% after VFT, and up to 94% after RL. Baselines remain low even after RL (11% and 1%). Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection, offering a practical path toward more transparent and safe AI systems.
Related papers
- Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective [82.24301452333577]
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning.<n>A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains.<n>We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains.
arXiv Detail & Related papers (2025-06-17T20:24:00Z) - Spurious Rewards: Rethinking Training Signals in RLVR [130.3484456088909]
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models.<n>For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4%.<n>We find code reasoning -- thinking in code without actual code execution -- to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR.
arXiv Detail & Related papers (2025-06-12T17:49:55Z) - The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason [36.50007948478452]
Our research investigates the impact of reward noise on post-training large language models.<n>We found that LLMs demonstrate strong robustness to substantial reward noise.<n>Our findings suggest the importance of improving models' foundational abilities during the pre-training phase.
arXiv Detail & Related papers (2025-05-28T17:59:03Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning [11.573904453859098]
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs)<n>Yet, RL's success relies on the reliability of rewards, which are provided by verifiers.<n>In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs.<n>We propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods.
arXiv Detail & Related papers (2025-05-20T17:16:44Z) - Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Preference As Reward (PAR) is a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning.<n>On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches.
arXiv Detail & Related papers (2025-02-26T02:57:59Z) - Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play [52.3079697845254]
eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training.<n>We show eva can create effective RL curricula and is robust across ablations.
arXiv Detail & Related papers (2024-10-31T08:15:32Z) - On Designing Effective RL Reward at Training Time for LLM Reasoning [14.006845442313134]
We evaluate popular reward models for RL training, including the Outcome-supervised Reward Model (ORM) and the Process-supervised Reward Model (PRM)<n>Surprisingly, even though these learned reward models have strong inference-time performances, they may NOT help or even hurt RL training.<n>We introduce two novel reward refinement techniques, including Clipping and Delta.
arXiv Detail & Related papers (2024-10-19T13:53:50Z) - ODIN: Disentangled Reward Mitigates Hacking in RLHF [127.35607931337019]
We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback.
A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores.
Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
arXiv Detail & Related papers (2024-02-11T22:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.