Related papers: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

URL: http://arxiv.org/abs/2501.13011v1
Date: Wed, 22 Jan 2025 16:53:08 GMT
Title: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Authors: Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah,
Abstract summary: We propose a training method which avoids agents learning undesired multi-step plans that receive high reward.<n>The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward.
Score: 17.055020939723676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

Related papers

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking [69.06218054848803]
We propose Adrial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game.<n>ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations.<n>ARA achieves the best alignment-utility tradeoff among all baselines.
arXiv Detail & Related papers (2026-02-02T07:34:57Z)
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.65034908728828]
Training large language models (LLMs) as interactive agents presents unique challenges. While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments. We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
World Models Increase Autonomy in Reinforcement Learning [6.151562278670799]
Reinforcement learning (RL) is an appealing paradigm for training intelligent agents. MoReFree agent adapts two key mechanisms, exploration and policy learning, to handle reset-free tasks. It exhibits superior data-efficiency across various reset-free tasks without access to environmental reward or demonstrations.
arXiv Detail & Related papers (2024-08-19T08:56:00Z)
Beyond Optimism: Exploration With Partially Observable Rewards [10.571972176725371]
Exploration in reinforcement learning (RL) remains an open challenge. We present a novel strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy.
arXiv Detail & Related papers (2024-06-20T00:42:02Z)
Meta-Learning Adversarial Bandit Algorithms [55.72892209124227]
We study online meta-learning with bandit feedback. We learn to tune online mirror descent generalization (OMD) with self-concordant barrier regularizers.
arXiv Detail & Related papers (2023-07-05T13:52:10Z)
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z)
URLB: Unsupervised Reinforcement Learning Benchmark [82.36060735454647]
We introduce the Unsupervised Reinforcement Learning Benchmark (URLB) URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards. We provide twelve continuous control tasks from three domains for evaluation and open-source code for eight leading unsupervised RL methods.
arXiv Detail & Related papers (2021-10-28T15:07:01Z)
Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC. We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)
PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.