A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback
- URL: http://arxiv.org/abs/2206.14906v1
- Date: Wed, 29 Jun 2022 20:49:45 GMT
- Title: A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback
- Authors: Saeed Masoudian, Julian Zimmert, Yevgeny Seldin
- Abstract summary: We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback.
We simultaneously achieves a near-optimal adversarial regret guarantee in the setting with fixed delays.
We also present an extension of the algorithm to the case of arbitrary delays.
- Score: 25.68113242132723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a modified tuning of the algorithm of Zimmert and Seldin [2020]
for adversarial multiarmed bandits with delayed feedback, which in addition to
the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin
simultaneously achieves a near-optimal regret guarantee in the stochastic
setting with fixed delays. Specifically, the adversarial regret guarantee is
$\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$
is the number of arms, and $d$ is the fixed delay, whereas the stochastic
regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i}
\log(T) + \frac{d}{\Delta_{i}\log K}) + d K^{1/3}\log K\right)$, where
$\Delta_i$ are the suboptimality gaps. We also present an extension of the
algorithm to the case of arbitrary delays, which is based on an oracle
knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} +
\sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime,
where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq
i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}\log K}) +
d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where
$\sigma_{max}$ is the maximal number of outstanding observations. Finally, we
present a lower bound that matches regret upper bound achieved by the skipping
technique of Zimmert and Seldin [2020] in the adversarial setting.
Related papers
- Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - Minimax Optimal Submodular Optimization with Bandit Feedback [13.805872311596739]
We consider maximizing a monotonic, submodular set function $f: 2[n] rightarrow [0,1]$ under bandit feedback.
Specifically, $f$ is unknown to the learner but at each time $t=1,dots,T$ the learner chooses a set $S_t subset [n]$ with $|S_t| leq k$ and receives reward $f(S_t) + eta_t$ where $eta_t$ is mean-zero sub-Gaussian noise.
arXiv Detail & Related papers (2023-10-27T20:19:03Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays [21.94728545221709]
We consider the Scale-Free Adversarial Multi Armed Bandit (MAB) problem with unrestricted feedback delays.
textttSFBanker achieves $mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps and $D$ is the total feedback delay.
arXiv Detail & Related papers (2021-10-26T04:06:51Z) - Differentially Private Multi-Armed Bandits in the Shuffle Model [58.22098764071924]
We give an $(varepsilon,delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model.
Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.
arXiv Detail & Related papers (2021-06-05T14:11:01Z) - Improved Analysis of Robustness of the Tsallis-INF Algorithm to
Adversarial Corruptions in Stochastic Multiarmed Bandits [12.462608802359936]
We derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021).
In particular, for $C = Thetaleft(fracTlog Tlog T$)$, where $T$ is the time horizon, we achieve an improvement by a multiplicative factor.
We also improve the dependence of the regret bound on time horizon from $log T$ to $log frac(K-1)T(sum_ineq i*frac1Delta_
arXiv Detail & Related papers (2021-03-23T12:26:39Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - An Algorithm for Stochastic and Adversarial Bandits with Switching Costs [10.549307055348596]
We propose an algorithm for adversarial multiarmed bandits with switching costs, where the algorithm pays a price $lambda$ every time it switches the arm being played.
Our algorithm is based on adaptation of the Tsallis-INF algorithm of Zimmert and Seldin (2021) and requires no prior knowledge of the regime or time horizon.
arXiv Detail & Related papers (2021-02-19T11:03:51Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.