Inferring Lexicographically-Ordered Rewards from Preferences
- URL: http://arxiv.org/abs/2202.10153v1
- Date: Mon, 21 Feb 2022 12:01:41 GMT
- Title: Inferring Lexicographically-Ordered Rewards from Preferences
- Authors: Alihan H\"uy\"uk, William R. Zame, Mihaela van der Schaar
- Abstract summary: This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences.
We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities.
- Score: 82.42854687952115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling the preferences of agents over a set of alternatives is a principal
concern in many areas. The dominant approach has been to find a single
reward/utility function with the property that alternatives yielding higher
rewards are preferred over alternatives yielding lower rewards. However, in
many settings, preferences are based on multiple, often competing, objectives;
a single reward function is not adequate to represent such preferences. This
paper proposes a method for inferring multi-objective reward-based
representations of an agent's observed preferences. We model the agent's
priorities over different objectives as entering lexicographically, so that
objectives with lower priorities matter only when the agent is indifferent with
respect to objectives with higher priorities. We offer two example applications
in healthcare, one inspired by cancer treatment, the other inspired by organ
transplantation, to illustrate how the lexicographically-ordered rewards we
learn can provide a better understanding of a decision-maker's preferences and
help improve policies when used in reinforcement learning.
Related papers
- Exploiting Preferences in Loss Functions for Sequential Recommendation via Weak Transitivity [4.7894654945375175]
A choice of optimization objective is immensely pivotal in the design of a recommender system.
We propose a novel method that extends original objectives to explicitly leverage the different levels of preferences as relative orders between their scores.
arXiv Detail & Related papers (2024-08-01T06:55:19Z) - Incentivized Learning in Principal-Agent Bandit Games [62.41639598376539]
This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent.
The principal can influence the agent's decisions by offering incentives which add up to his rewards.
We present nearly optimal learning algorithms for the principal's regret in both multi-armed and linear contextual settings.
arXiv Detail & Related papers (2024-03-06T16:00:46Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - Consistent Aggregation of Objectives with Diverse Time Preferences
Requires Non-Markovian Rewards [7.9456318392035845]
It is shown that Markovian aggregation of reward functions is not possible when the time preference for each objective may vary.
It follows that optimal multi-objective agents must admit rewards that are non-Markovian with respect to the individual objectives.
This work offers new insights into sequential, multi-objective agency and intertemporal choice, and has practical implications for the design of AI systems deployed to serve multiple generations of principals with varying time preference.
arXiv Detail & Related papers (2023-09-30T17:06:34Z) - Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z) - Multi-Target Multiplicity: Flexibility and Fairness in Target
Specification under Resource Constraints [76.84999501420938]
We introduce a conceptual and computational framework for assessing how the choice of target affects individuals' outcomes.
We show that the level of multiplicity that stems from target variable choice can be greater than that stemming from nearly-optimal models of a single target.
arXiv Detail & Related papers (2023-06-23T18:57:14Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z) - An AGM Approach to Revising Preferences [7.99536002595393]
We look at preference change arising out of an interaction between two elements: the first is an initial preference ranking encoding a pre-existing attitude; the second is new preference information signaling input from an authoritative source.
The aim is to adjust the initial preference and bring it in line with the new preference, without having to give up more information than necessary.
We model this process using the formal machinery of belief change, along the lines of the well-known AGM approach.
arXiv Detail & Related papers (2021-12-28T18:12:57Z) - Incentivizing Exploration with Selective Data Disclosure [94.32975679779491]
We propose and design recommendation systems that incentivize efficient exploration.
Agents arrive sequentially, choose actions and receive rewards, drawn from fixed but unknown action-specific distributions.
We attain optimal regret rate for exploration using a flexible frequentist behavioral model.
arXiv Detail & Related papers (2018-11-14T19:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.