Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction
- URL: http://arxiv.org/abs/2508.04216v1
- Date: Wed, 06 Aug 2025 08:48:55 GMT
- Title: Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction
- Authors: Ruike Song, Zeen Song, Huijie Guo, Wenwen Qiang,
- Abstract summary: External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks.<n>These systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers.<n>We propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path.
- Score: 5.518813485456855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM's internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without modifying the policy model or retraining PRM.
Related papers
- IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking [67.20568716300272]
Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
arXiv Detail & Related papers (2026-02-23T01:14:53Z) - Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning [59.76691952347156]
Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs)<n>Most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions.<n>We propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL.
arXiv Detail & Related papers (2026-01-26T21:38:20Z) - CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning [4.765206163164323]
CLEANER exploits intrinsic self-correction capabilities to eliminate error-contaminated context during data collection.<n>Similarity-Aware Adaptive Rollback mechanism autonomously constructs clean, purified trajectories.<n>Results show average accuracy gains of 6%, 3%, and 5% over baselines.
arXiv Detail & Related papers (2026-01-21T16:14:30Z) - SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z) - Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards [40.905635870672945]
Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer.<n>In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability.<n>This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process.
arXiv Detail & Related papers (2025-10-09T04:30:45Z) - Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner [31.033131727230277]
Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL)<n>We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training.<n>Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines.
arXiv Detail & Related papers (2025-07-31T07:54:58Z) - Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback [52.1410307583181]
We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
arXiv Detail & Related papers (2025-07-21T11:19:04Z) - RRO: LLM Agent Optimization Through Rising Reward Trajectories [52.579992804584464]
Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks.<n>In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task.<n>We propose Reward Rising Optimization (RRO) to mitigate this issue.
arXiv Detail & Related papers (2025-05-27T05:27:54Z) - Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models [50.4652276723694]
Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities.<n>Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%.
arXiv Detail & Related papers (2025-05-22T05:56:11Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning [25.817231106021552]
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks.<n>However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning.<n>In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning.
arXiv Detail & Related papers (2025-04-21T17:59:02Z) - ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [25.329712997545794]
We propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR)<n>ReARTeR enhances RAG systems' reasoning capabilities through post-training and test-time scaling.<n> Experimental results on multi-step reasoning benchmarks demonstrate significant improvements.
arXiv Detail & Related papers (2025-01-14T05:56:26Z) - ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [50.45155830888697]
We develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models.
We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget.
arXiv Detail & Related papers (2024-06-06T07:40:00Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.