Related papers: Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

URL: http://arxiv.org/abs/2505.24193v1
Date: Fri, 30 May 2025 04:05:52 GMT
Title: Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback
Authors: Ofir Schlisselberg, Tal Lancewicki, Peter Auer, Yishay Mansour,
Abstract summary: We study the multi-armed bandit problem with adversarially chosen delays in the Best-Both-Worlds (BoBW) framework.<n>Our main contribution is a new algorithm that matches the known lower bounds in each setting individually.<n>We provide a regret bound which scale as $sum_i>0left(log T/Delta_iright) + frac1Ksum Delta_i sigma_max$, where $Delta_i$ is the sub-optimality gap of arm $i$ and $
Score: 40.86836752251116
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of $\widetilde{O}(\sqrt{KT} + \sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $\sum_{i:\Delta_i>0}\left(\log T/\Delta_i\right) + \frac{1}{K}\sum \Delta_i \sigma_{max}$, where $\Delta_i$ is the sub-optimality gap of arm $i$ and $\sigma_{\max}$ is the maximum number of missing observations. To the best of our knowledge, this is the first BoBW algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes in delayed environment. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$.

Related papers

LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
We develop a emphBest-of-Both-Worlds (BoBW) algorithm with regret upper bounds in both adversarial regimes.<n>We show that the proposed algorithm achieves $Oleft(log(T)1+beta2+betaTfrac12+betaright)$ regret under the margin condition.
arXiv Detail & Related papers (2024-03-05T18:59:47Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring [41.20252349191698]
Partial monitoring is a generic framework of online decision-making problems with limited feedback.<n>We present a new framework and analysis for ExO with a hybrid regularizer.<n>In particular, we derive a regret bound of $O(sum_a neq a*)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds.
arXiv Detail & Related papers (2024-02-13T09:34:22Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Non-stationary Online Convex Optimization with Arbitrary Delays [50.46856739179311]
This paper investigates the delayed online convex optimization (OCO) in non-stationary environments. We first propose a simple algorithm, namely DOGD, which performs a gradient descent step for each delayed gradient according to their arrival order. We develop an improved algorithm, which reduces those dynamic regret bounds achieved by DOGD to $O(sqrtbardT(P_T+1))$.
arXiv Detail & Related papers (2023-05-20T07:54:07Z)
A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback [25.68113242132723]
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback. We simultaneously achieves a near-optimal adversarial regret guarantee in the setting with fixed delays. We also present an extension of the algorithm to the case of arbitrary delays.
arXiv Detail & Related papers (2022-06-29T20:49:45Z)
Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds [34.37963000493442]
This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both adversarial settings.
arXiv Detail & Related papers (2022-06-14T12:58:46Z)
Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits [22.18577284185939]
We develop robust best-of-both-worlds algorithms for heavy-tailed multi-armed bandits. textttHTINF simultaneously achieves the optimal regret for both and adversarial environments. textttAdaTINF is the first algorithm that can adapt to both $alpha$ and $sigma$ to achieve optimal gap-indepedent regret bound.
arXiv Detail & Related papers (2022-01-28T03:53:39Z)
Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms. We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$. In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.