Best Arm Identification in Restless Markov Multi-Armed Bandits
- URL: http://arxiv.org/abs/2203.15236v1
- Date: Tue, 29 Mar 2022 04:58:04 GMT
- Title: Best Arm Identification in Restless Markov Multi-Armed Bandits
- Authors: P. N. Karthik, Kota Srinivas Reddy, Vincent Y. F. Tan
- Abstract summary: We study the problem of identifying the best arm in a multi-armed bandit environment.
A decision entity wishes to find the index of the best arm as quickly as possible, subject to an upper bound error probability.
We show that this policy achieves an upper bound that depends on $R$ and is monotonically non-increasing as $Rtoinfty$.
- Score: 85.55466536537293
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of identifying the best arm in a multi-armed bandit
environment when each arm is a time-homogeneous and ergodic discrete-time
Markov process on a common, finite state space. The state evolution on each arm
is governed by the arm's transition probability matrix (TPM). A decision entity
that knows the set of arm TPMs but not the exact mapping of the TPMs to the
arms, wishes to find the index of the best arm as quickly as possible, subject
to an upper bound on the error probability. The decision entity selects one arm
at a time sequentially, and all the unselected arms continue to undergo state
evolution ({\em restless} arms). For this problem, we derive the first-known
problem instance-dependent asymptotic lower bound on the growth rate of the
expected time required to find the index of the best arm, where the asymptotics
is as the error probability vanishes. Further, we propose a sequential policy
that, for an input parameter $R$, forcibly selects an arm that has not been
selected for $R$ consecutive time instants. We show that this policy achieves
an upper bound that depends on $R$ and is monotonically non-increasing as
$R\to\infty$. The question of whether, in general, the limiting value of the
upper bound as $R\to\infty$ matches with the lower bound, remains open. We
identify a special case in which the upper and the lower bounds match. Prior
works on best arm identification have dealt with (a) independent and
identically distributed observations from the arms, and (b) rested Markov arms,
whereas our work deals with the more difficult setting of restless Markov arms.
Related papers
- Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting [67.1631453378926]
Graph-Triggered Bandits is a framework to generalize rested and restless bandits.
In this work, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs.
arXiv Detail & Related papers (2024-09-09T18:23:07Z) - A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback.
We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z) - Optimal Best Arm Identification with Fixed Confidence in Restless Bandits [66.700654953613]
We study best arm identification in a restless multi-armed bandit setting with finitely many arms.
The discrete-time data generated by each arm forms a homogeneous Markov chain taking values in a common, finite state space.
It is demonstrated that tracking the long-term behavior of a certain Markov decision process and its state-action visitation proportions are the key ingredients in analyzing the converse and achievability bounds.
arXiv Detail & Related papers (2023-10-20T10:04:05Z) - Federated Best Arm Identification with Heterogeneous Clients [62.36929749450298]
We study best arm identification in a federated multi-armed bandit setting with a central server and multiple clients.
We show that for any algorithm whose upper bound on the expected stopping time matches with the lower bound up to a multiplicative constant (em almost-optimal algorithm)
We propose a novel algorithm that communicates at exponential time instants, and demonstrate that it is almost-optimal.
arXiv Detail & Related papers (2022-10-14T13:09:11Z) - Max-Quantile Grouped Infinite-Arm Bandits [39.7239093424124]
bandit problem in which there are a number of groups each consisting of infinitely many arms.
We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm.
We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter.
arXiv Detail & Related papers (2022-10-04T00:46:49Z) - Almost Cost-Free Communication in Federated Best Arm Identification [76.12303738941254]
We study the problem of best arm identification in a federated learning multi-armed bandit setup with a central server and multiple clients.
We propose a novel algorithm sc FedElim that is based on successive elimination and communicates only in exponential time steps.
arXiv Detail & Related papers (2022-08-19T08:37:09Z) - Learning to Detect an Odd Restless Markov Arm with a Trembling Hand [12.467685221424032]
anomaly means that the transition probability matrix of one of the arms is different from the common TPM of each of the non-odd arms.
We devise a policy based on the principle of certainty equivalence, and demonstrate that under a continuous selection assumption and a certain regularity assumption on the TPMs, the policy achieves the lower bound arbitrarily closely.
Our achievability analysis is based on resolving the identifiability problem in the context of a certain countable-state controlled Markov process.
arXiv Detail & Related papers (2021-05-08T05:53:12Z) - Detecting an Odd Restless Markov Arm with a Trembling Hand [18.122816058828906]
We consider a multi-armed bandit in which each arm is a Markov process evolving on a finite state space.
The transition probability matrix of one of the arms (the odd arm) is different from the common transition probability matrix of all other arms.
A decision maker wishes to identify the odd arm as quickly as possible, while keeping the probability of decision error small.
arXiv Detail & Related papers (2020-05-13T11:27:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.