Related papers: Delay as Payoff in MAB

Delay as Payoff in MAB

URL: http://arxiv.org/abs/2408.15158v2
Date: Tue, 15 Oct 2024 05:03:28 GMT
Title: Delay as Payoff in MAB
Authors: Ofir Schlisselberg, Ido Cohen, Tal Lancewicki, Yishay Mansour,
Abstract summary: We investigate a variant of the classical Multi-armed Bandit (MAB) problem, where the payoff received by an agent is both delayed, and directly corresponds to the magnitude of the delay. Our main contributions are tight upper and lower bounds for both the cost and reward settings. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward.
Score: 40.65658965074464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we investigate a variant of the classical stochastic Multi-armed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent's cost); or a user's time spent on a web page given a choice of content (where delay serves as the agent's reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales as $\sum_{i:\Delta_i > 0}\frac{\log T}{\Delta_i} + d^*$, where $T$ is the maximal number of steps, $\Delta_i$ are the sub-optimality gaps and $d^*$ is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret of $\sum_{i:\Delta_i > 0}\frac{\log T}{\Delta_i} + \bar{d}$, where $\bar d$ is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales as $\sum_{i:\Delta_i > 0}\frac{\log T}{\Delta_i} + D$, where $D$ is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation.

Related papers

Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs [60.7808741738461]
We study online learning with oblivious losses delays under a novel clairvoyance'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Our algorithms achieve mini-optimal regret across all capacity levels, with performance gracefully under suboptimal capacity.
arXiv Detail & Related papers (2025-03-25T17:20:39Z)
Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay. We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general. We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z)
A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback [25.68113242132723]
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback. We simultaneously achieves a near-optimal adversarial regret guarantee in the setting with fixed delays. We also present an extension of the algorithm to the case of arbitrary delays.
arXiv Detail & Related papers (2022-06-29T20:49:45Z)
Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback [67.63049551992816]
We study online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. We present the first algorithms that achieve near-optimal $sqrtK + D$ regret, where $K$ is the number of episodes and $D = sum_k=1K dk$ is the total delay.
arXiv Detail & Related papers (2022-01-31T12:34:26Z)
Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z)
Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus. We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z)
Online Strongly Convex Optimization with Unknown Delays [30.931538196386672]
We investigate the problem of online convex optimization with unknown delays. We first extend the delayed variant of OGD for strongly convex functions. We establish a better regret bound of $O(dlog T)$, where $d$ is the maximum delay.
arXiv Detail & Related papers (2021-03-21T10:16:15Z)
Learning Adversarial Markov Decision Processes with Delayed Feedback [45.86354980347581]
We consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. We present novel algorithms that achieve near-optimal high-probability regret of $widetilde O ( sqrtK + sqrtD )$ under full-information feedback.
arXiv Detail & Related papers (2020-12-29T16:47:42Z)
Adapting to Delays and Data in Adversarial Multi-Armed Bandits [7.310043452300736]
We analyze variants of the Exp3 algorithm that tune their step-size using only information available at the time of the decisions. We obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses.
arXiv Detail & Related papers (2020-10-12T20:53:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.