Related papers: Mitigating Goal Misgeneralization via Minimax Regret

Mitigating Goal Misgeneralization via Minimax Regret

URL: http://arxiv.org/abs/2507.03068v2
Date: Fri, 18 Jul 2025 07:38:55 GMT
Title: Mitigating Goal Misgeneralization via Minimax Regret
Authors: Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, Michael Dennis,
Abstract summary: This paper theoretically studies the possibility of goal misgeneralization under different training objectives.<n>We then empirically show that the standard MEV-based training method exhibits goal misgeneralization in procedurally-generated grid-world environments.<n>Our findings suggest that minimax expected regret is a promising approach to mitigating goal misgeneralization.
Score: 8.346033392846856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safe generalization in reinforcement learning requires not only that a learned policy acts capably in new situations, but also that it uses its capabilities towards the pursuit of the designer's intended goal. The latter requirement may fail when a proxy goal incentivizes similar behavior to the intended goal within the training environment, but not in novel deployment environments. This creates the risk that policies will behave as if in pursuit of the proxy goal, rather than the intended goal, in deployment -- a phenomenon known as goal misgeneralization. In this paper, we formalize this problem setting in order to theoretically study the possibility of goal misgeneralization under different training objectives. We show that goal misgeneralization is possible under approximate optimization of the maximum expected value (MEV) objective, but not the minimax expected regret (MMER) objective. We then empirically show that the standard MEV-based training method of domain randomization exhibits goal misgeneralization in procedurally-generated grid-world environments, whereas current regret-based unsupervised environment design (UED) methods are more robust to goal misgeneralization (though they don't find MMER policies in all cases). Our findings suggest that minimax expected regret is a promising approach to mitigating goal misgeneralization.

Related papers

Horizon Generalization in Reinforcement Learning [22.372738655730107]
We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization.<n>We show that this notion of horizon generalization is closely linked with invariance to planning.<n>A policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Thus, such a policy trained to reach nearby goals should succeed at reaching arbitrarily-distant goals.
arXiv Detail & Related papers (2025-01-06T01:42:46Z)
Towards Measuring Goal-Directedness in AI Systems [0.0]
A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner. We propose a new family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many reward functions. Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals.
arXiv Detail & Related papers (2024-10-07T01:34:42Z)
Imitating Graph-Based Planning with Goal-Conditioned Policies [72.61631088613048]
We present a self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods.
arXiv Detail & Related papers (2023-03-20T14:51:10Z)
Discrete Factorial Representations as an Abstraction for Goal Conditioned Reinforcement Learning [99.38163119531745]
We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups. We experimentally prove the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive structure.
arXiv Detail & Related papers (2022-11-01T03:31:43Z)
On the Power of Pre-training for Generalization in RL: Provable Benefits and Hardness [47.09873295916592]
Generalization in Reinforcement Learning (RL) aims to learn an agent during training that generalizes to the target environment. This paper studies RL generalization from a theoretical aspect: how much can we expect pre-training over training environments to be helpful? When the interaction with the target environment is not allowed, we certify that the best we can obtain is a near-optimal policy in an average sense, and we design an algorithm that achieves this goal.
arXiv Detail & Related papers (2022-10-19T10:58:24Z)
Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals [21.055450435866028]
We show how an AI system may pursue an undesired goal even when the specification is correct. Goal misgeneralization is a specific form of robustness failure for learning algorithms. We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.
arXiv Detail & Related papers (2022-10-04T17:57:53Z)
Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space [76.46113138484947]
General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach goals for a wide range of tasks on command. We propose Planning to Practice, a method that makes it practical to train goal-conditioned policies for long-horizon tasks.
arXiv Detail & Related papers (2022-05-17T06:58:17Z)
Generative multitask learning mitigates target-causing confounding [61.21582323566118]
We propose a simple and scalable approach to causal representation learning for multitask learning. The improvement comes from mitigating unobserved confounders that cause the targets, but not the input. Our results on the Attributes of People and Taskonomy datasets reflect the conceptual improvement in robustness to prior probability shift.
arXiv Detail & Related papers (2022-02-08T20:42:14Z)
Adversarial Intrinsic Motivation for Reinforcement Learning [60.322878138199364]
We investigate whether the Wasserstein-1 distance between a policy's state visitation distribution and a target distribution can be utilized effectively for reinforcement learning tasks. Our approach, termed Adversarial Intrinsic Motivation (AIM), estimates this Wasserstein-1 distance through its dual objective and uses it to compute a supplemental reward function.
arXiv Detail & Related papers (2021-05-27T17:51:34Z)
Automatic Curriculum Learning through Value Disagreement [95.19299356298876]
Continually solving new, unsolved tasks is the key to learning diverse behaviors. In the multi-task domain, where an agent needs to reach multiple goals, the choice of training goals can largely affect sample efficiency. We propose setting up an automatic curriculum for goals that the agent needs to solve. We evaluate our method across 13 multi-goal robotic tasks and 5 navigation tasks, and demonstrate performance gains over current state-of-the-art methods.
arXiv Detail & Related papers (2020-06-17T03:58:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.