Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards
- URL: http://arxiv.org/abs/2308.06717v1
- Date: Sun, 13 Aug 2023 08:12:01 GMT
- Title: Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards
- Authors: Ilgin Dogan, Zuo-Jun Max Shen, Anil Aswani
- Abstract summary: In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
- Score: 4.742123770879715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In practice, incentive providers (i.e., principals) often cannot observe the
reward realizations of incentivized agents, which is in contrast to many
principal-agent models that have been previously studied. This information
asymmetry challenges the principal to consistently estimate the agent's unknown
rewards by solely watching the agent's decisions, which becomes even more
challenging when the agent has to learn its own rewards. This complex setting
is observed in various real-life scenarios ranging from renewable energy
storage contracts to personalized healthcare incentives. Hence, it offers not
only interesting theoretical questions but also wide practical relevance. This
paper explores a repeated adverse selection game between a self-interested
learning agent and a learning principal. The agent tackles a multi-armed bandit
(MAB) problem to maximize their expected reward plus incentive. On top of the
agent's learning, the principal trains a parallel algorithm and faces a
trade-off between consistently estimating the agent's unknown rewards and
maximizing their own utility by offering adaptive incentives to lead the agent.
For a non-parametric model, we introduce an estimator whose only input is the
history of principal's incentives and agent's choices. We unite this estimator
with a proposed data-driven incentive policy within a MAB framework. Without
restricting the type of the agent's algorithm, we prove finite-sample
consistency of the estimator and a rigorous regret bound for the principal by
considering the sequential externality imposed by the agent. Lastly, our
theoretical results are reinforced by simulations justifying applicability of
our framework to green energy aggregator contracts.
Related papers
- Multi-Agent Imitation Learning: Value is Easy, Regret is Hard [52.31989962031179]
We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents.
Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations.
While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee to deviations by strategic agents.
arXiv Detail & Related papers (2024-06-06T16:18:20Z) - Incentivized Learning in Principal-Agent Bandit Games [62.41639598376539]
This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent.
The principal can influence the agent's decisions by offering incentives which add up to his rewards.
We present nearly optimal learning algorithms for the principal's regret in both multi-armed and linear contextual settings.
arXiv Detail & Related papers (2024-03-06T16:00:46Z) - Principal-Agent Reward Shaping in MDPs [50.914110302917756]
Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest.
We study a two-player Stack game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players.
Our results establish trees and deterministic decision processes with a finite horizon.
arXiv Detail & Related papers (2023-12-30T18:30:44Z) - Repeated Principal-Agent Games with Unobserved Agent Rewards and
Perfect-Knowledge Agents [5.773269033551628]
We study a scenario of repeated principal-agent games within a multi-armed bandit (MAB) framework.
We design our policy by first constructing an estimator for the agent's expected reward for each bandit arm.
We conclude with numerical simulations demonstrating the applicability of our policy to real-life setting from collaborative transportation planning.
arXiv Detail & Related papers (2023-04-14T21:57:16Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Learning to Incentivize Other Learning Agents [73.03133692589532]
We show how to equip RL agents with the ability to give rewards directly to other agents, using a learned incentive function.
Such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games.
Our work points toward more opportunities and challenges along the path to ensure the common good in a multi-agent future.
arXiv Detail & Related papers (2020-06-10T20:12:38Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z) - Incentivizing Exploration with Selective Data Disclosure [70.11902902106014]
We propose and design recommendation systems that incentivize efficient exploration.
Agents arrive sequentially, choose actions and receive rewards, drawn from fixed but unknown action-specific distributions.
We attain optimal regret rate for exploration using a flexible frequentist behavioral model.
arXiv Detail & Related papers (2018-11-14T19:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.