Repeated Principal-Agent Games with Unobserved Agent Rewards and
Perfect-Knowledge Agents
- URL: http://arxiv.org/abs/2304.07407v2
- Date: Sun, 7 May 2023 19:30:01 GMT
- Title: Repeated Principal-Agent Games with Unobserved Agent Rewards and
Perfect-Knowledge Agents
- Authors: Ilgin Dogan, Zuo-Jun Max Shen, and Anil Aswani
- Abstract summary: We study a scenario of repeated principal-agent games within a multi-armed bandit (MAB) framework.
We design our policy by first constructing an estimator for the agent's expected reward for each bandit arm.
We conclude with numerical simulations demonstrating the applicability of our policy to real-life setting from collaborative transportation planning.
- Score: 5.773269033551628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by a number of real-world applications from domains like healthcare
and sustainable transportation, in this paper we study a scenario of repeated
principal-agent games within a multi-armed bandit (MAB) framework, where: the
principal gives a different incentive for each bandit arm, the agent picks a
bandit arm to maximize its own expected reward plus incentive, and the
principal observes which arm is chosen and receives a reward (different than
that of the agent) for the chosen arm. Designing policies for the principal is
challenging because the principal cannot directly observe the reward that the
agent receives for their chosen actions, and so the principal cannot directly
learn the expected reward using existing estimation techniques. As a result,
the problem of designing policies for this scenario, as well as similar ones,
remains mostly unexplored. In this paper, we construct a policy that achieves a
low regret (i.e., square-root regret up to a log factor) in this scenario for
the case where the agent has perfect-knowledge about its own expected rewards
for each bandit arm. We design our policy by first constructing an estimator
for the agent's expected reward for each bandit arm. Since our estimator uses
as data the sequence of incentives offered and subsequently chosen arms, the
principal's estimation can be regarded as an analogy of online inverse
optimization in MAB's. Next we construct a policy that we prove achieves a low
regret by deriving finite-sample concentration bounds for our estimator. We
conclude with numerical simulations demonstrating the applicability of our
policy to real-life setting from collaborative transportation planning.
Related papers
- Stochastic Bandits for Egalitarian Assignment [58.33714486693828]
We study EgalMAB, an egalitarian assignment problem in the context of multi-armed bandits.
We design and analyze a UCB-based policy EgalUCB and establish upper bounds on the cumulative regret.
arXiv Detail & Related papers (2024-10-08T09:49:47Z) - Incentivized Learning in Principal-Agent Bandit Games [62.41639598376539]
This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent.
The principal can influence the agent's decisions by offering incentives which add up to his rewards.
We present nearly optimal learning algorithms for the principal's regret in both multi-armed and linear contextual settings.
arXiv Detail & Related papers (2024-03-06T16:00:46Z) - Principal-Agent Reward Shaping in MDPs [50.914110302917756]
Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest.
We study a two-player Stack game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players.
Our results establish trees and deterministic decision processes with a finite horizon.
arXiv Detail & Related papers (2023-12-30T18:30:44Z) - Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents [57.627352949446625]
We consider a variant of the multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or absorb them.
We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
arXiv Detail & Related papers (2023-12-13T06:54:49Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Information-Gathering in Latent Bandits [79.6953033727455]
We present a method for information-gathering in latent bandits.
We show that choosing the best arm given the agent's beliefs about the states incurs higher regret.
We also show that by choosing arms carefully, we obtain an improved estimation of the state distribution.
arXiv Detail & Related papers (2022-07-08T01:15:12Z) - A Farewell to Arms: Sequential Reward Maximization on a Budget with a
Giving Up Option [5.1629297054995265]
We consider a sequential decision-making problem where an agent can take one action at a time and each action has a temporal extent.
We introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound.
arXiv Detail & Related papers (2020-03-06T22:16:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.