IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
- URL: http://arxiv.org/abs/2602.19416v1
- Date: Mon, 23 Feb 2026 01:14:53 GMT
- Title: IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking
- Authors: Mohammad Beigi, Ming Jin, Junshan Zhang, Jiaxin Zhang, Qifan Wang, Lifu Huang,
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
- Score: 67.20568716300272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque, making hacking behaviors difficult to detect or correct. We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models. We propose Contrastive Inverse Reinforcement Learning (C-IRL), which reconstructs the implicit reward function by contrasting paired responses from post-alignment and baseline policies to explain behavioral shifts during RLHF. We then decompose the reconstructed reward via sparse autoencoders into interpretable features, enabling identification of hacking signatures through contribution analysis. Finally, we propose mitigation strategies - clean reward optimization, adversarial shaping, constrained optimization, and feature-guided distillation - that target problematic features while preserving beneficial alignment. Experiments across multiple reward model configurations show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
Related papers
- Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking [69.06218054848803]
We propose Adrial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game.<n>ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations.<n>ARA achieves the best alignment-utility tradeoff among all baselines.
arXiv Detail & Related papers (2026-02-02T07:34:57Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking [13.417125511014447]
We propose an automated framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences.<n>PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches.
arXiv Detail & Related papers (2025-10-14T23:18:24Z) - IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards [22.802937805177773]
Instruct Following Decorator (IFDecorator) is a framework that wraps RLVR training into a robust and sample-efficient pipeline.<n>Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o.<n>Our trip wires show significant reductions in reward hacking rates.
arXiv Detail & Related papers (2025-08-06T17:00:54Z) - Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction [5.518813485456855]
External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks.<n>These systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers.<n>We propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path.
arXiv Detail & Related papers (2025-08-06T08:48:55Z) - Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Preference As Reward (PAR) is a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning.<n>On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches.
arXiv Detail & Related papers (2025-02-26T02:57:59Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.