Competing Bandits in Time Varying Matching Markets
- URL: http://arxiv.org/abs/2210.11692v1
- Date: Fri, 21 Oct 2022 02:36:57 GMT
- Title: Competing Bandits in Time Varying Matching Markets
- Authors: Deepan Muthirayan, Chinmay Maheshwari, Pramod P. Khargonekar, Shankar
Sastry
- Abstract summary: We study the problem of online learning in two-sided non-stationary matching markets, where the objective is to converge to a stable match.
We propose the it Restart Competing Bandits (RCB) algorithm, which combines a simple it restart strategy to handle the non-stationarity.
We show that, with the proposed algorithm, each player receives a uniform sub-linear regret of $widetildemathcalO(L1/2_TT1/2)$ up to the number of changes in the underlying preference of agents
- Score: 1.1470070927586014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of online learning in two-sided non-stationary matching
markets, where the objective is to converge to a stable match. In particular,
we consider the setting where one side of the market, the arms, has fixed known
set of preferences over the other side, the players. While this problem has
been studied when the players have fixed but unknown preferences, in this work
we study the problem of how to learn when the preferences of the players are
time varying. We propose the {\it Restart Competing Bandits (RCB)} algorithm,
which combines a simple {\it restart strategy} to handle the non-stationarity
with the {\it competing bandits} algorithm \citep{liu2020competing} designed
for the stationary case. We show that, with the proposed algorithm, each player
receives a uniform sub-linear regret of
{$\widetilde{\mathcal{O}}(L^{1/2}_TT^{1/2})$} up to the number of changes in
the underlying preference of agents, $L_T$. We also discuss extensions of this
algorithm to the case where the number of changes need not be known a priori.
Related papers
- Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities [69.34646544774161]
We formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures arrival of requests to each arm and the policy of allocating requests to players.
The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile.
We design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds.
arXiv Detail & Related papers (2024-08-20T13:57:00Z) - Player-optimal Stable Regret for Bandit Learning in Matching Markets [12.54215178882448]
We show that the optimal stable regret of each player can be upper bounded by $O(Klog T/Delta2)$ where $K$ is the number of arms, $T$ is the horizon and $Delta$ is the players' minimum preference gap among the first $N+1$ ranked arms.
Our work significantly improves previous works which either have a weaker player-pessimal stable matching objective or apply only to markets with special assumptions.
arXiv Detail & Related papers (2023-07-20T14:10:33Z) - Hardness of Independent Learning and Sparse Equilibrium Computation in
Markov Games [70.19141208203227]
We consider the problem of decentralized multi-agent reinforcement learning in Markov games.
We show that no algorithm attains no-regret in general-sum games when executed independently by all players.
We show that our lower bounds hold even for seemingly easier setting in which all agents are controlled by a centralized algorithm.
arXiv Detail & Related papers (2023-03-22T03:28:12Z) - Provably Efficient Fictitious Play Policy Optimization for Zero-Sum
Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions.
We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario.
Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z) - Decentralized Competing Bandits in Non-Stationary Matching Markets [46.13741000158631]
We introduce the framework of decentralized two-sided matching market under non stationary (dynamic) environments.
We propose and analyze a decentralized and asynchronous learning algorithm, namely Decentralized Non-stationary Competing Bandits (textttDNCB)
We characterize this emphforced exploration and obtain sub-linear (logarithmic) regret of textttDNCB.
arXiv Detail & Related papers (2022-05-31T21:05:30Z) - Learning Equilibria in Matching Markets from Bandit Feedback [139.29934476625488]
We develop a framework and algorithms for learning stable market outcomes under uncertainty.
Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.
arXiv Detail & Related papers (2021-08-19T17:59:28Z) - Bandit Learning in Decentralized Matching Markets [82.39061186055775]
We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience.
This model extends the standard multi-armed bandit framework to a decentralized multiple player setting with competition.
We show that the algorithm is incentive compatible whenever the arms' preferences are shared, but not necessarily so when preferences are fully general.
arXiv Detail & Related papers (2020-12-14T08:58:07Z) - Multitask Bandit Learning Through Heterogeneous Feedback Aggregation [35.923544685900055]
We formulate the problem as the $epsilon$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms.
We develop an upper confidence bound-based algorithm, RobustAgg$(epsilon)$, that adaptively aggregates rewards collected by different players.
arXiv Detail & Related papers (2020-10-29T07:13:28Z) - Contextual Blocking Bandits [35.235375147227124]
We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards.
Playing an arm blocks it (across all contexts) for a fixed and known number of future time steps.
We propose a UCB-based variant of the full-information algorithm that guarantees a $mathcalO(log T)$-regret w.r.t. an $alpha$regret strategy in $T time steps, matching the $Omega(log(T)$ lower bound
arXiv Detail & Related papers (2020-03-06T20:34:42Z) - Learning Zero-Sum Simultaneous-Move Markov Games Using Function
Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games.
In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap.
In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.