Bandits with many optimal arms
- URL: http://arxiv.org/abs/2103.12452v1
- Date: Tue, 23 Mar 2021 11:02:31 GMT
- Title: Bandits with many optimal arms
- Authors: Rianne de Heide and James Cheshire and Pierre M\'enard and Alexandra
Carpentier
- Abstract summary: We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
- Score: 68.17472536610859
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a stochastic bandit problem with a possibly infinite number of
arms. We write $p^*$ for the proportion of optimal arms and $\Delta$ for the
minimal mean-gap between optimal and sub-optimal arms. We characterize the
optimal learning rates both in the cumulative regret setting, and in the
best-arm identification setting in terms of the problem parameters $T$ (the
budget), $p^*$ and $\Delta$. For the objective of minimizing the cumulative
regret, we provide a lower bound of order $\Omega(\log(T)/(p^*\Delta))$ and a
UCB-style algorithm with matching upper bound up to a factor of
$\log(1/\Delta)$. Our algorithm needs $p^*$ to calibrate its parameters, and we
prove that this knowledge is necessary, since adapting to $p^*$ in this setting
is impossible. For best-arm identification we also provide a lower bound of
order $\Omega(\exp(-cT\Delta^2p^*))$ on the probability of outputting a
sub-optimal arm where $c>0$ is an absolute constant. We also provide an
elimination algorithm with an upper bound matching the lower bound up to a
factor of order $\log(1/\Delta)$ in the exponential, and that does not need
$p^*$ or $\Delta$ as parameter.
Related papers
- A/B Testing and Best-arm Identification for Linear Bandits with
Robustness to Non-stationarity [28.068960555415014]
We investigate the fixed-budget best-arm identification problem for linear bandits in a potentially non-stationary environment.
An algorithm will aim to correctly identify the best arm $x* := argmax_xinmathcalXxtopsum_t=1Ttheta_t$ with probability as high as possible.
arXiv Detail & Related papers (2023-07-27T19:03:36Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - On the complexity of All $\varepsilon$-Best Arms Identification [2.1485350418225244]
We consider the problem of identifying all the $varepsilon$-optimal arms in a finite multi-armed bandit with Gaussian rewards.
We propose a Track-and-Stop algorithm that identifies the set of $varepsilon$-good arms w.h.p and enjoys optimality (when $delta$ goes to zero) in terms of the expected sample complexity.
arXiv Detail & Related papers (2022-02-13T10:54:52Z) - Differentially Private Multi-Armed Bandits in the Shuffle Model [58.22098764071924]
We give an $(varepsilon,delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model.
Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.
arXiv Detail & Related papers (2021-06-05T14:11:01Z) - Online Convex Optimization with Continuous Switching Constraint [78.25064451417082]
We introduce the problem of online convex optimization with continuous switching constraint.
We show that, for strongly convex functions, the regret bound can be improved to $O(log T)$ for $S=Omega(log T)$, and $O(minT/exp(S)+S,T)$ for $S=O(log T)$.
arXiv Detail & Related papers (2021-03-21T11:43:35Z) - Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment [7.4288915613206505]
We study differentially private online learning problems in a environment under both bandit and full information feedback.
For differentially private bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O left(sum_j: Delta_j>0 fracln(T)min leftDelta_j, epsilon right right)$ minimax lower bound.
For the same differentially private full information setting, we also present an $epsilon$-differentially
arXiv Detail & Related papers (2021-02-16T02:48:16Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Optimal $\delta$-Correct Best-Arm Selection for Heavy-Tailed
Distributions [2.2940141855172036]
We consider the problem of identifying the one with the maximum mean using a $delta$-correct algorithm.
Lower bounds for $delta$-correct algorithms are well known.
We propose a $delta$-correct algorithm that matches the lower bound as $delta$ reduces to zero.
arXiv Detail & Related papers (2019-08-24T05:31:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.