Adapting to Delays and Data in Adversarial Multi-Armed Bandits
- URL: http://arxiv.org/abs/2010.06022v1
- Date: Mon, 12 Oct 2020 20:53:52 GMT
- Title: Adapting to Delays and Data in Adversarial Multi-Armed Bandits
- Authors: Andr\'as Gy\"orgy, Pooria Joulani
- Abstract summary: We analyze variants of the Exp3 algorithm that tune their step-size using only information available at the time of the decisions.
We obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses.
- Score: 7.310043452300736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the adversarial multi-armed bandit problem under delayed
feedback. We analyze variants of the Exp3 algorithm that tune their step-size
using only information (about the losses and delays) available at the time of
the decisions, and obtain regret guarantees that adapt to the observed (rather
than the worst-case) sequences of delays and/or losses. First, through a
remarkably simple proof technique, we show that with proper tuning of the step
size, the algorithm achieves an optimal (up to logarithmic factors) regret of
order $\sqrt{\log(K)(TK + D)}$ both in expectation and in high probability,
where $K$ is the number of arms, $T$ is the time horizon, and $D$ is the
cumulative delay. The high-probability version of the bound, which is the first
high-probability delay-adaptive bound in the literature, crucially depends on
the use of implicit exploration in estimating the losses. Then, following
Zimmert and Seldin [2019], we extend these results so that the algorithm can
"skip" rounds with large delays, resulting in regret bounds of order
$\sqrt{TK\log(K)} + |R| + \sqrt{D_{\bar{R}}\log(K)}$, where $R$ is an arbitrary
set of rounds (which are skipped) and $D_{\bar{R}}$ is the cumulative delay of
the feedback for other rounds. Finally, we present another, data-adaptive
(AdaGrad-style) version of the algorithm for which the regret adapts to the
observed (delayed) losses instead of only adapting to the cumulative delay
(this algorithm requires an a priori upper bound on the maximum delay, or the
advance knowledge of the delay for each decision when it is made). The
resulting bound can be orders of magnitude smaller on benign problems, and it
can be shown that the delay only affects the regret through the loss of the
best arm.
Related papers
- Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays [24.998122268199797]
We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback.
Our algorithm can tolerate arbitrary excessive delays up to order $T$.
arXiv Detail & Related papers (2023-08-21T12:17:40Z) - Non-stationary Online Convex Optimization with Arbitrary Delays [50.46856739179311]
This paper investigates the delayed online convex optimization (OCO) in non-stationary environments.
We first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order.
We develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(sqrtbardT(P_T+1))$.
arXiv Detail & Related papers (2023-05-20T07:54:07Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Nonstochastic Bandits and Experts with Arm-Dependent Delays [17.272515865592542]
We study nonstochastic bandits and experts in a delayed setting where delays depend on both time and arms.
Our analyses hinge on a novel bound on the drift, measuring how much better an algorithm can perform when given a look-ahead of one round.
arXiv Detail & Related papers (2021-11-02T13:36:11Z) - Procrastinated Tree Search: Black-box Optimization with Delayed, Noisy,
and Multi-fidelity Feedback [11.064341598231195]
In black-box optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle.
We propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS)
We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multi-fidelity feedbacks.
arXiv Detail & Related papers (2021-10-14T08:55:41Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions [54.25616645675032]
We study the Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm.
We consider two settings: the reward-dependent delay setting, where realized delays may depend on the rewards, and the reward-independent delay setting.
Our main contribution is algorithms that achieve near-optimal regret in each of the settings.
arXiv Detail & Related papers (2021-06-04T12:26:06Z) - Online Strongly Convex Optimization with Unknown Delays [30.931538196386672]
We investigate the problem of online convex optimization with unknown delays.
We first extend the delayed variant of OGD for strongly convex functions.
We establish a better regret bound of $O(dlog T)$, where $d$ is the maximum delay.
arXiv Detail & Related papers (2021-03-21T10:16:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.