Related papers: Goal Alignment: A Human-Aware Account of Value Alignment Problem

Goal Alignment: A Human-Aware Account of Value Alignment Problem

URL: http://arxiv.org/abs/2302.00813v1
Date: Thu, 2 Feb 2023 01:18:57 GMT
Title: Goal Alignment: A Human-Aware Account of Value Alignment Problem
Authors: Malek Mechergui and Sarath Sreedharan
Abstract summary: Value alignment problems arise in scenarios where the specified objectives of an AI agent don't match the true underlying objective of its users. A foundational cause for misalignment is the inherent asymmetry in human expectations about the agent's behavior and the behavior generated by the agent for the specified objective. We propose a novel formulation for the value alignment problem, named goal alignment that focuses on a few central challenges related to value alignment.
Score: 16.660807368368758
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Value alignment problems arise in scenarios where the specified objectives of an AI agent don't match the true underlying objective of its users. The problem has been widely argued to be one of the central safety problems in AI. Unfortunately, most existing works in value alignment tend to focus on issues that are primarily related to the fact that reward functions are an unintuitive mechanism to specify objectives. However, the complexity of the objective specification mechanism is just one of many reasons why the user may have misspecified their objective. A foundational cause for misalignment that is being overlooked by these works is the inherent asymmetry in human expectations about the agent's behavior and the behavior generated by the agent for the specified objective. To address this lacuna, we propose a novel formulation for the value alignment problem, named goal alignment that focuses on a few central challenges related to value alignment. In doing so, we bridge the currently disparate research areas of value alignment and human-aware planning. Additionally, we propose a first-of-its-kind interactive algorithm that is capable of using information generated under incorrect beliefs about the agent, to determine the true underlying goal of the user.

Related papers

Towards Measuring Goal-Directedness in AI Systems [0.0]
A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner. We propose a new family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many reward functions. Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals.
arXiv Detail & Related papers (2024-10-07T01:34:42Z)
Quantifying Misalignment Between Agents: Towards a Sociotechnical Understanding of Alignment [2.619545850602691]
Recent sociotechnical approaches highlight the need to understand complex misalignment among multiple human and AI agents. We adapt a computational social science model of human contention to the alignment problem. Our model quantifies misalignment in large, diverse agent groups with potentially conflicting goals.
arXiv Detail & Related papers (2024-06-06T16:31:22Z)
Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch [19.03141646688652]
We use the theory of mind, i.e., the human user's beliefs about the AI agent, as a basis to develop a formal explanatory framework. We propose a new interactive algorithm that uses the specified reward to infer potential user expectations.
arXiv Detail & Related papers (2024-04-12T19:43:37Z)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values. Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework. In these games, the agent solves a partially observable Markov decision process. We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z)
Aligned with Whom? Direct and social goals for AI systems [0.0]
This article distinguishes two types of alignment problems depending on whose goals we consider. The direct alignment problem considers whether an AI system accomplishes the goals of the entity operating it. The social alignment problem considers the effects of an AI system on larger groups or on society more broadly.
arXiv Detail & Related papers (2022-05-09T13:49:47Z)
Generative multitask learning mitigates target-causing confounding [61.21582323566118]
We propose a simple and scalable approach to causal representation learning for multitask learning. The improvement comes from mitigating unobserved confounders that cause the targets, but not the input. Our results on the Attributes of People and Taskonomy datasets reflect the conceptual improvement in robustness to prior probability shift.
arXiv Detail & Related papers (2022-02-08T20:42:14Z)
Understanding the origin of information-seeking exploration in probabilistic objectives for control [62.997667081978825]
An exploration-exploitation trade-off is central to the description of adaptive behaviour. One approach to solving this trade-off has been to equip or propose that agents possess an intrinsic 'exploratory drive' We show that this combination of utility maximizing and information-seeking behaviour arises from the minimization of an entirely difference class of objectives.
arXiv Detail & Related papers (2021-03-11T18:42:39Z)
Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits. We show that this discrepancy is due to the emphaction-stability of their objectives. In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z)
ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects [119.46959413000594]
This document summarizes the consensus recommendations of a working group on ObjectNav. We make recommendations on subtle but important details of evaluation criteria. We provide a detailed description of the instantiation of these recommendations in challenges organized at the Embodied AI workshop at CVPR 2020.
arXiv Detail & Related papers (2020-06-23T17:18:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.