Player-optimal Stable Regret for Bandit Learning in Matching Markets
- URL: http://arxiv.org/abs/2307.10890v1
- Date: Thu, 20 Jul 2023 14:10:33 GMT
- Title: Player-optimal Stable Regret for Bandit Learning in Matching Markets
- Authors: Fang Kong, Shuai Li
- Abstract summary: We show that the optimal stable regret of each player can be upper bounded by $O(Klog T/Delta2)$ where $K$ is the number of arms, $T$ is the horizon and $Delta$ is the players' minimum preference gap among the first $N+1$ ranked arms.
Our work significantly improves previous works which either have a weaker player-pessimal stable matching objective or apply only to markets with special assumptions.
- Score: 12.54215178882448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The problem of matching markets has been studied for a long time in the
literature due to its wide range of applications. Finding a stable matching is
a common equilibrium objective in this problem. Since market participants are
usually uncertain of their preferences, a rich line of recent works study the
online setting where one-side participants (players) learn their unknown
preferences from iterative interactions with the other side (arms). Most
previous works in this line are only able to derive theoretical guarantees for
player-pessimal stable regret, which is defined compared with the players'
least-preferred stable matching. However, under the pessimal stable matching,
players only obtain the least reward among all stable matchings. To maximize
players' profits, player-optimal stable matching would be the most desirable.
Though \citet{basu21beyond} successfully bring an upper bound for
player-optimal stable regret, their result can be exponentially large if
players' preference gap is small. Whether a polynomial guarantee for this
regret exists is a significant but still open problem. In this work, we provide
a new algorithm named explore-then-Gale-Shapley (ETGS) and show that the
optimal stable regret of each player can be upper bounded by $O(K\log
T/\Delta^2)$ where $K$ is the number of arms, $T$ is the horizon and $\Delta$
is the players' minimum preference gap among the first $N+1$-ranked arms. This
result significantly improves previous works which either have a weaker
player-pessimal stable matching objective or apply only to markets with special
assumptions. When the preferences of participants satisfy some special
conditions, our regret upper bound also matches the previously derived lower
bound.
Related papers
- Improved Bandits in Many-to-one Matching Markets with Incentive Compatibility [12.05174711411919]
Two-sided matching markets have been widely studied in the literature due to their rich applications.
We first extend an existing algorithm for the one-to-one setting to this more general setting and show it achieves a near-optimal bound for player-optimal regret.
We propose the adaptively explore-then-deferred (AETDA) algorithm for responsiveness setting and derive an upper bound for player-optimal stable regret.
arXiv Detail & Related papers (2024-01-03T03:45:35Z) - Competing Bandits in Time Varying Matching Markets [1.1470070927586014]
We study the problem of online learning in two-sided non-stationary matching markets, where the objective is to converge to a stable match.
We propose the it Restart Competing Bandits (RCB) algorithm, which combines a simple it restart strategy to handle the non-stationarity.
We show that, with the proposed algorithm, each player receives a uniform sub-linear regret of $widetildemathcalO(L1/2_TT1/2)$ up to the number of changes in the underlying preference of agents
arXiv Detail & Related papers (2022-10-21T02:36:57Z) - Learning Correlated Equilibria in Mean-Field Games [62.14589406821103]
We develop the concepts of Mean-Field correlated and coarse-correlated equilibria.
We show that they can be efficiently learnt in emphall games, without requiring any additional assumption on the structure of the game.
arXiv Detail & Related papers (2022-08-22T08:31:46Z) - Can Reinforcement Learning Find Stackelberg-Nash Equilibria in
General-Sum Markov Games with Myopic Followers? [156.5760265539888]
We study multi-player general-sum Markov games with one of the players designated as the leader and the other players regarded as followers.
For such a game, our goal is to find a Stackelberg-Nash equilibrium (SNE), which is a policy pair $(pi*, nu*)$.
We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings.
arXiv Detail & Related papers (2021-12-27T05:41:14Z) - Learning Equilibria in Matching Markets from Bandit Feedback [139.29934476625488]
We develop a framework and algorithms for learning stable market outcomes under uncertainty.
Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.
arXiv Detail & Related papers (2021-08-19T17:59:28Z) - Bandit Learning in Decentralized Matching Markets [82.39061186055775]
We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience.
This model extends the standard multi-armed bandit framework to a decentralized multiple player setting with competition.
We show that the algorithm is incentive compatible whenever the arms' preferences are shared, but not necessarily so when preferences are fully general.
arXiv Detail & Related papers (2020-12-14T08:58:07Z) - Multitask Bandit Learning Through Heterogeneous Feedback Aggregation [35.923544685900055]
We formulate the problem as the $epsilon$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms.
We develop an upper confidence bound-based algorithm, RobustAgg$(epsilon)$, that adaptively aggregates rewards collected by different players.
arXiv Detail & Related papers (2020-10-29T07:13:28Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - Learning Zero-Sum Simultaneous-Move Markov Games Using Function
Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games.
In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap.
In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z) - Selfish Robustness and Equilibria in Multi-Player Bandits [25.67398941667429]
In a game, several players simultaneously pull arms and encounter a collision - with 0 reward - if some of them pull the same arm at the same time.
While the cooperative case where players maximize the collective reward has been mostly considered, to malicious players is a crucial and challenging concern.
We shall consider instead the more natural class of selfish players whose incentives are to maximize their individual rewards, potentially at the expense of the social welfare.
arXiv Detail & Related papers (2020-02-04T09:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.