Multi-Armed Bandits with Dependent Arms
- URL: http://arxiv.org/abs/2010.09478v2
- Date: Fri, 23 Oct 2020 21:06:01 GMT
- Title: Multi-Armed Bandits with Dependent Arms
- Authors: Rahul Singh, Fang Liu, Yin Sun and Ness Shroff
- Abstract summary: We study a variant of the classical multi-armed bandit problem (MABP) which we call as Multi-Armed Bandits with dependent arms.
Multiple arms are grouped together to form a cluster, and the reward distributions of arms belonging to the same cluster are known functions of an unknown parameter that is a characteristic of the cluster.
We develop learning algorithms based on the UCB principle which utilize these additional side observations appropriately while performing exploration-exploitation trade-off.
- Score: 18.81667618369821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a variant of the classical multi-armed bandit problem (MABP) which
we call as Multi-Armed Bandits with dependent arms. More specifically, multiple
arms are grouped together to form a cluster, and the reward distributions of
arms belonging to the same cluster are known functions of an unknown parameter
that is a characteristic of the cluster. Thus, pulling an arm $i$ not only
reveals information about its own reward distribution, but also about all those
arms that share the same cluster with arm $i$. This "correlation" amongst the
arms complicates the exploration-exploitation trade-off that is encountered in
the MABP because the observation dependencies allow us to test simultaneously
multiple hypotheses regarding the optimality of an arm. We develop learning
algorithms based on the UCB principle which utilize these additional side
observations appropriately while performing exploration-exploitation trade-off.
We show that the regret of our algorithms grows as $O(K\log T)$, where $K$ is
the number of clusters. In contrast, for an algorithm such as the vanilla UCB
that is optimal for the classical MABP and does not utilize these dependencies,
the regret scales as $O(M\log T)$ where $M$ is the number of arms.
Related papers
- A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback.
We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z) - Representative Arm Identification: A fixed confidence approach to identify cluster representatives [7.459521930846415]
We study the representative arm identification problem in the multi-armed bandits (MAB) framework.
The RAI problem covers as special cases several well-studied MAB problems such as identifying the best arm or any $M$ out of the top $K$.
We propose two algorithms, based on the idea of confidence intervals, and provide high probability upper bounds on their sample complexity.
arXiv Detail & Related papers (2024-08-26T11:47:52Z) - Complexity Analysis of a Countable-armed Bandit Problem [9.163501953373068]
We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$.
We propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $mathcalOleft( log n right)$ when $K=2$.
While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter.
arXiv Detail & Related papers (2023-01-18T00:53:46Z) - Max-Quantile Grouped Infinite-Arm Bandits [39.7239093424124]
bandit problem in which there are a number of groups each consisting of infinitely many arms.
We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm.
We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter.
arXiv Detail & Related papers (2022-10-04T00:46:49Z) - Almost Cost-Free Communication in Federated Best Arm Identification [76.12303738941254]
We study the problem of best arm identification in a federated learning multi-armed bandit setup with a central server and multiple clients.
We propose a novel algorithm sc FedElim that is based on successive elimination and communicates only in exponential time steps.
arXiv Detail & Related papers (2022-08-19T08:37:09Z) - Best Arm Identification in Restless Markov Multi-Armed Bandits [85.55466536537293]
We study the problem of identifying the best arm in a multi-armed bandit environment.
A decision entity wishes to find the index of the best arm as quickly as possible, subject to an upper bound error probability.
We show that this policy achieves an upper bound that depends on $R$ and is monotonically non-increasing as $Rtoinfty$.
arXiv Detail & Related papers (2022-03-29T04:58:04Z) - Best Arm Identification under Additive Transfer Bandits [49.69203462561861]
We study the setting when, despite the means being unknown, there is a known additive relationship between the source and target MAB instances.
We propose and theoretically analyze an LUCB-style algorithm to identify an $epsilon$-optimal target arm with high probability.
arXiv Detail & Related papers (2021-12-08T02:20:18Z) - Best-Arm Identification in Correlated Multi-Armed Bandits [9.180690233707645]
We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms.
We show that the total samples obtained by C-LUCB are of the form $mathcalOleft(sum_k in mathcalC logleft(frac1deltaright)$ as opposed to the typical $mathcalOleft(sum_k in mathcalC logleft(frac1deltaright)$ samples required in the independent reward setting
arXiv Detail & Related papers (2021-09-10T15:41:07Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.