Revisiting Simple Regret Minimization in Multi-Armed Bandits
- URL: http://arxiv.org/abs/2210.16913v1
- Date: Sun, 30 Oct 2022 18:31:03 GMT
- Title: Revisiting Simple Regret Minimization in Multi-Armed Bandits
- Authors: Yao Zhao, Connor Stephens, Csaba Szepesv\'ari, Kwang-Sung Jun
- Abstract summary: Simple regret is less popular than the probability of missing the best arm or an $epsilon$-good arm.
In this paper, we achieve improved simple regret upper bounds for both data-rich ($Tge n$) and data-poor regime ($T le n$)
For the more challenging data-poor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once.
- Score: 33.88679679593314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Simple regret is a natural and parameter-free performance criterion for
identifying a good arm in multi-armed bandits yet is less popular than the
probability of missing the best arm or an $\epsilon$-good arm, perhaps due to
lack of easy ways to characterize it. In this paper, we achieve improved simple
regret upper bounds for both data-rich ($T\ge n$) and data-poor regime ($T \le
n$) where $n$ is the number of arms and $T$ is the number of samples. At its
heart is an improved analysis of the well-known Sequential Halving (SH)
algorithm that bounds the probability of returning an arm whose mean reward is
not within $\epsilon$ from the best (i.e., not $\epsilon$-good) for any choice
of $\epsilon>0$, although $\epsilon$ is not an input to SH. We show that this
directly implies an optimal simple regret bound of $\mathcal{O}(\sqrt{n/T})$.
Furthermore, our upper bound gets smaller as a function of the number of
$\epsilon$-good arms. This results in an accelerated rate for the
$(\epsilon,\delta)$-PAC criterion, which closes the gap between the upper and
lower bounds in prior art. For the more challenging data-poor regime, we
propose Bracketing SH (BSH) that enjoys the same improvement even without
sampling each arm at least once. Our empirical study shows that BSH outperforms
existing methods on real-world tasks.
Related papers
- Almost Minimax Optimal Best Arm Identification in Piecewise Stationary Linear Bandits [55.957560311008926]
We propose a piecewise stationary linear bandit (PSLB) model where the quality of an arm is measured by its return averaged over all contexts.
PS$varepsilon$BAI$+$ is guaranteed to identify an $varepsilon$-optimal arm with probability $ge 1-delta$ and with a minimal number of samples.
arXiv Detail & Related papers (2024-10-10T06:15:42Z) - On the complexity of All $\varepsilon$-Best Arms Identification [2.1485350418225244]
We consider the problem of identifying all the $varepsilon$-optimal arms in a finite multi-armed bandit with Gaussian rewards.
We propose a Track-and-Stop algorithm that identifies the set of $varepsilon$-good arms w.h.p and enjoys optimality (when $delta$ goes to zero) in terms of the expected sample complexity.
arXiv Detail & Related papers (2022-02-13T10:54:52Z) - Best-Arm Identification in Correlated Multi-Armed Bandits [9.180690233707645]
We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms.
We show that the total samples obtained by C-LUCB are of the form $mathcalOleft(sum_k in mathcalC logleft(frac1deltaright)$ as opposed to the typical $mathcalOleft(sum_k in mathcalC logleft(frac1deltaright)$ samples required in the independent reward setting
arXiv Detail & Related papers (2021-09-10T15:41:07Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - An Optimal Elimination Algorithm for Learning a Best Arm [37.18327505953403]
We consider the classic problem of $(epsilon,delta)$-PAC learning a best arm.
We propose a new approach for $(epsilon,delta)$-PAC learning a best arm.
arXiv Detail & Related papers (2020-06-20T20:21:33Z) - Explicit Best Arm Identification in Linear Bandits Using No-Regret
Learners [17.224805430291177]
We study the problem of best arm identification in linearly parameterised multi-armed bandits.
We propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem.
arXiv Detail & Related papers (2020-06-13T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.