Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents
- URL: http://arxiv.org/abs/2312.07929v1
- Date: Wed, 13 Dec 2023 06:54:49 GMT
- Title: Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents
- Authors: Seyed A. Esmaeili, Suho Shin, Aleksandrs Slivkins
- Abstract summary: We consider a variant of the multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or absorb them.
We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
- Score: 57.627352949446625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a variant of the stochastic multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or
absorb them. The utility of an agent increases if she is pulled more or absorbs
more of her rewards but decreases if she spends more effort improving her
rewards. Agents have heterogeneous properties, specifically having different
means and able to improve their rewards up to different levels. Further, a
non-empty subset of agents are ''honest'' and in the worst case always give
their rewards without absorbing any part. The principal wishes to obtain a high
revenue (cumulative reward) by designing a mechanism that incentives top level
performance at equilibrium. At the same time, the principal wishes to be robust
and obtain revenue at least at the level of the honest agent with the highest
mean in case of non-equilibrium behaviour. We identify a class of MAB
algorithms which we call performance incentivizing which satisfy a collection
of properties and show that they lead to mechanisms that incentivize top level
performance at equilibrium and are robust under any strategy profile.
Interestingly, we show that UCB is an example of such a MAB algorithm. Further,
in the case where the top performance level is unknown we show that combining
second price auction ideas with performance incentivizing algorithms achieves
performance at least at the second top level while also being robust.
Related papers
- Principal-Agent Reward Shaping in MDPs [50.914110302917756]
Principal-agent problems arise when one party acts on behalf of another, leading to conflicts of interest.
We study a two-player Stack game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players.
Our results establish trees and deterministic decision processes with a finite horizon.
arXiv Detail & Related papers (2023-12-30T18:30:44Z) - Bandits Meet Mechanism Design to Combat Clickbait in Online
Recommendation [50.469872635246176]
We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit.
This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards.
arXiv Detail & Related papers (2023-11-27T09:19:01Z) - Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z) - Competing for Shareable Arms in Multi-Player Multi-Armed Bandits [29.08799537067425]
We study a novel multi-player multi-armed bandit (MPMAB) setting where players are selfish and aim to maximize their own rewards.
We propose a novel Selfish MPMAB with Averaging Allocation (SMAA) approach based on the equilibrium.
We establish that no single selfish player can significantly increase their rewards through deviation, nor can they detrimentally affect other players' rewards without incurring substantial losses for themselves.
arXiv Detail & Related papers (2023-05-30T15:59:56Z) - Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios.
We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations.
Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Learning in Stackelberg Games with Non-myopic Agents [60.927889817803745]
We study Stackelberg games where a principal repeatedly interacts with a non-myopic long-lived agent, without knowing the agent's payoff function.
We provide a general framework that reduces learning in presence of non-myopic agents to robust bandit optimization in the presence of myopic agents.
arXiv Detail & Related papers (2022-08-19T15:49:30Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z) - Modelling Cournot Games as Multi-agent Multi-armed Bandits [4.751331778201811]
We investigate the use of a multi-agent multi-armed bandit (MA-MAB) setting for modeling repeated Cournot oligopoly games.
We find that an $epsilon$-greedy approach offers a more viable learning mechanism than other traditional MAB approaches.
We propose two novel approaches that take advantage of the ordered action space: $epsilon$-greedy+HL and $epsilon$-greedy+EL.
arXiv Detail & Related papers (2022-01-01T22:02:47Z) - Fairness in Ranking under Uncertainty [42.51950847766776]
Unfairness occurs when an agent with higher merit obtains a worse outcome than an agent with lower merit.
Our central point is that a primary cause of unfairness is uncertainty.
We show how to compute rankings that optimally trade off approximate fairness against utility to the principal.
arXiv Detail & Related papers (2021-07-14T14:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.