Going Beyond Heuristics by Imposing Policy Improvement as a Constraint
- URL: http://arxiv.org/abs/2507.05328v1
- Date: Mon, 07 Jul 2025 17:52:53 GMT
- Title: Going Beyond Heuristics by Imposing Policy Improvement as a Constraint
- Authors: Chi-Chang Lee, Zhang-Wei Hong, Pulkit Agrawal,
- Abstract summary: Heuristic Enhanced Policy Optimization is a plug-and-play method for leveraging reward hacking in reinforcement learning.<n>We show that HEPO effectively leverages reward hacking while avoiding the pitfall of prior methods for mitigating reward hacking.<n>More surprisingly, HEPO allows policy optimization to achieve good performance even when rewards are not well-engineered and designed by non-expert humans.
- Score: 27.085318050269965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many reinforcement learning (RL) applications, augmenting the task rewards with heuristic rewards that encode human priors about how a task should be solved is crucial for achieving desirable performance. However, because such heuristics are usually not optimal, much human effort and computational resources are wasted in carefully balancing tasks and heuristic rewards. Theoretically rigorous ways of incorporating heuristics rely on the idea of \textit{policy invariance}, which guarantees that the performance of a policy obtained by maximizing heuristic rewards is the same as the optimal policy with respect to the task reward. However, in practice, policy invariance doesn't result in policy improvement, and such methods are known to empirically perform poorly. We propose a new paradigm to mitigate reward hacking and effectively use heuristics based on the practical goal of maximizing policy improvement instead of policy improvement. Our framework, Heuristic Enhanced Policy Optimization (HEPO), effectively leverages heuristics while avoiding the pitfall of prior methods for mitigating reward hacking. HEPO achieves superior performance on standard benchmarks with well-engineered reward functions. More surprisingly, HEPO allows policy optimization to achieve good performance even when heuristics are not well-engineered and designed by non-expert humans, showcasing HEPO's ability to reduce human effort in reward design. % HEPO is a plug-and-play optimization method for leveraging heuristics in reinforcement learning. Code is available at https://github.com/Improbable-AI/hepo.
Related papers
- On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning [36.00719049772089]
We propose the Trust Region Preference Approximation (TRPA) algorithm.<n>As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue.<n> Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability.
arXiv Detail & Related papers (2025-04-06T15:48:26Z) - $α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.
It balances the policy model and the reference model to achieve personalized reward margins.
It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Self-Improving Robust Preference Optimization [22.493029742076605]
Online and offline RLHF methods have been highly successful in aligning AI with human preferences.<n>We propose Self-Improving Robust Preference Optimization (SRPO), a practical and mathematically principled offline RLHF framework.<n>We show that SRPO can be efficiently optimized using standard supervised learning techniques at scale.
arXiv Detail & Related papers (2024-06-03T17:53:25Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world.
Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it.
We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z) - Admissible Policy Teaching through Reward Design [32.39785256112934]
We study reward design strategies for incentivizing a reinforcement learning agent to adopt a policy from a set of admissible policies.
The goal of the reward designer is to modify the underlying reward function cost-efficiently while ensuring that any approximately optimal deterministic policy under the new reward function is admissible.
arXiv Detail & Related papers (2022-01-06T18:49:57Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.