Learning in Restless Bandits under Exogenous Global Markov Process
- URL: http://arxiv.org/abs/2112.09484v1
- Date: Fri, 17 Dec 2021 12:47:30 GMT
- Title: Learning in Restless Bandits under Exogenous Global Markov Process
- Authors: Tomer Gafni, Michal Yemini, Kobi Cohen
- Abstract summary: We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics.
Under each global state, the rewards process of each arm evolves according to an unknown Markovian rule.
We develop the Learning under Exogenous Markov Process (LEMP) algorithm to minimize regret.
- Score: 13.836565669337057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider an extension to the restless multi-armed bandit (RMAB) problem
with unknown arm dynamics, where an unknown exogenous global Markov process
governs the rewards distribution of each arm. Under each global state, the
rewards process of each arm evolves according to an unknown Markovian rule,
which is non-identical among different arms. At each time, a player chooses an
arm out of $N$ arms to play, and receives a random reward from a finite set of
reward states. The arms are restless, that is, their local state evolves
regardless of the player's actions. Motivated by recent studies on related RMAB
settings, the regret is defined as the reward loss with respect to a player
that knows the dynamics of the problem, and plays at each time $t$ the arm that
maximizes the expected immediate value. The objective is to develop an
arm-selection policy that minimizes the regret. To that end, we develop the
Learning under Exogenous Markov Process (LEMP) algorithm. We analyze LEMP
theoretically and establish a finite-sample bound on the regret. We show that
LEMP achieves a logarithmic regret order with time. We further analyze LEMP
numerically and present simulation results that support the theoretical
findings and demonstrate that LEMP significantly outperforms alternative
algorithms.
Related papers
- Provably Efficient Reinforcement Learning for Adversarial Restless Multi-Armed Bandits with Unknown Transitions and Bandit Feedback [11.262167746929332]
Restless multi-armed bandits (RMAB) play a central role in modeling sequential decision making problems.
In this paper, we consider the task of learning in episodic RMAB with unknown transition functions and adversarial rewards.
We develop a novel reinforcement learning algorithm with two key contributors.
arXiv Detail & Related papers (2024-05-02T02:20:19Z) - Bandits Meet Mechanism Design to Combat Clickbait in Online
Recommendation [50.469872635246176]
We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit.
This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards.
arXiv Detail & Related papers (2023-11-27T09:19:01Z) - Optimal Best Arm Identification with Fixed Confidence in Restless Bandits [66.700654953613]
We study best arm identification in a restless multi-armed bandit setting with finitely many arms.
The discrete-time data generated by each arm forms a homogeneous Markov chain taking values in a common, finite state space.
It is demonstrated that tracking the long-term behavior of a certain Markov decision process and its state-action visitation proportions are the key ingredients in analyzing the converse and achievability bounds.
arXiv Detail & Related papers (2023-10-20T10:04:05Z) - Best Arm Identification in Restless Markov Multi-Armed Bandits [85.55466536537293]
We study the problem of identifying the best arm in a multi-armed bandit environment.
A decision entity wishes to find the index of the best arm as quickly as possible, subject to an upper bound error probability.
We show that this policy achieves an upper bound that depends on $R$ and is monotonically non-increasing as $Rtoinfty$.
arXiv Detail & Related papers (2022-03-29T04:58:04Z) - Restless Multi-Armed Bandits under Exogenous Global Markov Process [20.58296570065978]
We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics.
Under each global state, the rewards process of each arm evolves according to an unknown Markovian rule.
We develop the Learning under Exogenous Markov Process (LEMP) algorithm, that achieves a logarithmic regret order with time.
arXiv Detail & Related papers (2022-02-28T10:29:42Z) - Best Arm Identification under Additive Transfer Bandits [49.69203462561861]
We study the setting when, despite the means being unknown, there is a known additive relationship between the source and target MAB instances.
We propose and theoretically analyze an LUCB-style algorithm to identify an $epsilon$-optimal target arm with high probability.
arXiv Detail & Related papers (2021-12-08T02:20:18Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Restless-UCB, an Efficient and Low-complexity Algorithm for Online
Restless Bandits [61.490254407420906]
We study the online restless bandit problem, where the state of each arm evolves according to a Markov chain.
We propose Restless-UCB, a learning policy that follows the explore-then-commit framework.
arXiv Detail & Related papers (2020-11-05T05:16:04Z) - Detecting an Odd Restless Markov Arm with a Trembling Hand [18.122816058828906]
We consider a multi-armed bandit in which each arm is a Markov process evolving on a finite state space.
The transition probability matrix of one of the arms (the odd arm) is different from the common transition probability matrix of all other arms.
A decision maker wishes to identify the odd arm as quickly as possible, while keeping the probability of decision error small.
arXiv Detail & Related papers (2020-05-13T11:27:14Z) - Ballooning Multi-Armed Bandits [12.205797997133397]
We introduce Ballooning Multi-Armed Bandits (BL-MAB)
In BL-MAB, the set of available arms grows (or balloons) over time.
We show that if the best arm is equally likely to arrive at any time instant, a sublinear regret cannot be achieved.
arXiv Detail & Related papers (2020-01-24T04:35:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.