Goal Misgeneralization: Why Correct Specifications Aren't Enough For
Correct Goals
- URL: http://arxiv.org/abs/2210.01790v1
- Date: Tue, 4 Oct 2022 17:57:53 GMT
- Title: Goal Misgeneralization: Why Correct Specifications Aren't Enough For
Correct Goals
- Authors: Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria
Krakovna, Jonathan Uesato, Zac Kenton
- Abstract summary: We show how an AI system may pursue an undesired goal even when the specification is correct.
Goal misgeneralization is a specific form of robustness failure for learning algorithms.
We suggest several research directions that could reduce the risk of goal misgeneralization for future systems.
- Score: 21.055450435866028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of AI alignment is concerned with AI systems that pursue unintended
goals. One commonly studied mechanism by which an unintended goal might arise
is specification gaming, in which the designer-provided specification is flawed
in a way that the designers did not foresee. However, an AI system may pursue
an undesired goal even when the specification is correct, in the case of goal
misgeneralization. Goal misgeneralization is a specific form of robustness
failure for learning algorithms in which the learned program competently
pursues an undesired goal that leads to good performance in training situations
but bad performance in novel test situations. We demonstrate that goal
misgeneralization can occur in practical systems by providing several examples
in deep learning systems across a variety of domains. Extrapolating forward to
more capable systems, we provide hypotheticals that illustrate how goal
misgeneralization could lead to catastrophic risk. We suggest several research
directions that could reduce the risk of goal misgeneralization for future
systems.
Related papers
- Towards Measuring Goal-Directedness in AI Systems [0.0]
A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner.
We propose a new family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many reward functions.
Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals.
arXiv Detail & Related papers (2024-10-07T01:34:42Z) - Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch [19.03141646688652]
We use the theory of mind, i.e., the human user's beliefs about the AI agent, as a basis to develop a formal explanatory framework.
We propose a new interactive algorithm that uses the specified reward to infer potential user expectations.
arXiv Detail & Related papers (2024-04-12T19:43:37Z) - Foundation Policies with Hilbert Representations [54.44869979017766]
We propose an unsupervised framework to pre-train generalist policies from unlabeled offline data.
Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment.
Our experiments show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion.
arXiv Detail & Related papers (2024-02-23T19:09:10Z) - AI Deception: A Survey of Examples, Risks, and Potential Solutions [20.84424818447696]
This paper argues that a range of current AI systems have learned how to deceive humans.
We define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth.
arXiv Detail & Related papers (2023-08-28T17:59:35Z) - Interpretable Self-Aware Neural Networks for Robust Trajectory
Prediction [50.79827516897913]
We introduce an interpretable paradigm for trajectory prediction that distributes the uncertainty among semantic concepts.
We validate our approach on real-world autonomous driving data, demonstrating superior performance over state-of-the-art baselines.
arXiv Detail & Related papers (2022-11-16T06:28:20Z) - Discrete Factorial Representations as an Abstraction for Goal
Conditioned Reinforcement Learning [99.38163119531745]
We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups.
We experimentally prove the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive structure.
arXiv Detail & Related papers (2022-11-01T03:31:43Z) - Generalizing in the Real World with Representation Learning [1.3494312389622642]
Machine learning (ML) formalizes the problem of getting computers to learn from experience as optimization of performance according to some metric(s)
This is in contrast to requiring behaviour specified in advance (e.g. by hard-coded rules)
In this thesis I cover some of my work towards better understanding deep net generalization, identify several ways assumptions and problem settings fail to generalize to the real world, and propose ways to address those failures in practice.
arXiv Detail & Related papers (2022-10-18T15:11:09Z) - Control-Aware Prediction Objectives for Autonomous Driving [78.19515972466063]
We present control-aware prediction objectives (CAPOs) to evaluate the downstream effect of predictions on control without requiring the planner be differentiable.
We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories.
arXiv Detail & Related papers (2022-04-28T07:37:21Z) - Goal-Aware Cross-Entropy for Multi-Target Reinforcement Learning [15.33496710690063]
We propose goal-aware cross-entropy (GACE) loss, that can be utilized in a self-supervised way.
We then devise goal-discriminative attention networks (GDAN) which utilize the goal-relevant information to focus on the given instruction.
arXiv Detail & Related papers (2021-10-25T14:24:39Z) - Multi Agent System for Machine Learning Under Uncertainty in Cyber
Physical Manufacturing System [78.60415450507706]
Recent advancements in predictive machine learning has led to its application in various use cases in manufacturing.
Most research focused on maximising predictive accuracy without addressing the uncertainty associated with it.
In this paper, we determine the sources of uncertainty in machine learning and establish the success criteria of a machine learning system to function well under uncertainty.
arXiv Detail & Related papers (2021-07-28T10:28:05Z) - Automatic Curriculum Learning through Value Disagreement [95.19299356298876]
Continually solving new, unsolved tasks is the key to learning diverse behaviors.
In the multi-task domain, where an agent needs to reach multiple goals, the choice of training goals can largely affect sample efficiency.
We propose setting up an automatic curriculum for goals that the agent needs to solve.
We evaluate our method across 13 multi-goal robotic tasks and 5 navigation tasks, and demonstrate performance gains over current state-of-the-art methods.
arXiv Detail & Related papers (2020-06-17T03:58:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.