A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays
- URL: http://arxiv.org/abs/2308.10675v2
- Date: Mon, 27 May 2024 19:30:57 GMT
- Title: A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays
- Authors: Saeed Masoudian, Julian Zimmert, Yevgeny Seldin,
- Abstract summary: We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback.
Our algorithm can tolerate arbitrary excessive delays up to order $T$.
- Score: 24.998122268199797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback. In contrast to prior work, which required prior knowledge of the maximal delay $d_{\mathrm{max}}$ and had a linear dependence of the regret on it, our algorithm can tolerate arbitrary excessive delays up to order $T$ (where $T$ is the time horizon). The algorithm is based on three technical innovations, which may all be of independent interest: (1) We introduce the first implicit exploration scheme that works in best-of-both-worlds setting. (2) We introduce the first control of distribution drift that does not rely on boundedness of delays. The control is based on the implicit exploration scheme and adaptive skipping of observations with excessive delays. (3) We introduce a procedure relating standard regret with drifted regret that does not rely on boundedness of delays. At the conceptual level, we demonstrate that complexity of best-of-both-worlds bandits with delayed feedback is characterized by the amount of information missing at the time of decision making (measured by the number of outstanding observations) rather than the time that the information is missing (measured by the delays).
Related papers
- Biased Dueling Bandits with Stochastic Delayed Feedback [6.167074802065416]
We present two algorithms designed to handle situations involving delay.
Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay.
The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available.
arXiv Detail & Related papers (2024-08-26T19:49:12Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - Stochastic Submodular Bandits with Delayed Composite Anonymous Bandit
Feedback [39.12903814606534]
This paper investigates the problem of multiarmed bandits with submodular (in expectation) rewards and full-bandit delayed feedback.
The delayed feedback is composed of components of rewards from past actions, with unknown division among the sub-components.
The considered algorithm is demonstrated to outperform other full-bandit approaches with delayed composite anonymous feedback.
arXiv Detail & Related papers (2023-03-23T18:38:33Z) - Borda Regret Minimization for Generalized Linear Dueling Bandits [65.09919504862496]
We study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score.
We propose a rich class of generalized linear dueling bandit models, which cover many existing models.
Our algorithm achieves an $tildeO(d2/3 T2/3)$ regret, which is also optimal.
arXiv Detail & Related papers (2023-03-15T17:59:27Z) - Delayed Feedback in Generalised Linear Bandits Revisited [5.349852254138085]
We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner.
We show that a natural adaptation of an optimistic algorithm to the delayed feedback achieves a regret bound where the penalty for the delays is independent of the horizon.
arXiv Detail & Related papers (2022-07-21T23:35:01Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - Nonstochastic Bandits and Experts with Arm-Dependent Delays [17.272515865592542]
We study nonstochastic bandits and experts in a delayed setting where delays depend on both time and arms.
Our analyses hinge on a novel bound on the drift, measuring how much better an algorithm can perform when given a look-ahead of one round.
arXiv Detail & Related papers (2021-11-02T13:36:11Z) - Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions [54.25616645675032]
We study the Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm.
We consider two settings: the reward-dependent delay setting, where realized delays may depend on the rewards, and the reward-independent delay setting.
Our main contribution is algorithms that achieve near-optimal regret in each of the settings.
arXiv Detail & Related papers (2021-06-04T12:26:06Z) - Adapting to Delays and Data in Adversarial Multi-Armed Bandits [7.310043452300736]
We analyze variants of the Exp3 algorithm that tune their step-size using only information available at the time of the decisions.
We obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses.
arXiv Detail & Related papers (2020-10-12T20:53:52Z) - Stochastic bandits with arm-dependent delays [102.63128271054741]
We propose a simple but efficient UCB-based algorithm called the PatientBandits.
We provide both problems-dependent and problems-independent bounds on the regret as well as performance lower bounds.
arXiv Detail & Related papers (2020-06-18T12:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.