A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial
Semi-Bandits, Linear Bandits, and MDPs
- URL: http://arxiv.org/abs/2305.08629v1
- Date: Mon, 15 May 2023 13:21:50 GMT
- Title: A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial
Semi-Bandits, Linear Bandits, and MDPs
- Authors: Dirk van der Hoeven and Lukas Zierahn and Tal Lancewicki and Aviv
Rosenberg and Nicol\'o Cesa-Bianchi
- Abstract summary: We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback.
Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.
- Score: 18.199326045904996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We derive a new analysis of Follow The Regularized Leader (FTRL) for online
learning with delayed bandit feedback. By separating the cost of delayed
feedback from that of bandit feedback, our analysis allows us to obtain new
results in three important settings. On the one hand, we derive the first
optimal (up to logarithmic factors) regret bounds for combinatorial
semi-bandits with delay and adversarial Markov decision processes with delay
(and known transition functions). On the other hand, we use our analysis to
derive an efficient algorithm for linear bandits with delay achieving
near-optimal regret bounds. Our novel regret decomposition shows that FTRL
remains stable across multiple rounds under mild assumptions on the Hessian of
the regularizer.
Related papers
- Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Posterior Sampling with Delayed Feedback for Reinforcement Learning with
Linear Function Approximation [62.969796245827006]
Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling.
We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays.
We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
arXiv Detail & Related papers (2023-10-29T06:12:43Z) - A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays [24.998122268199797]
We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback.
Our algorithm can tolerate arbitrary excessive delays up to order $T$.
arXiv Detail & Related papers (2023-08-21T12:17:40Z) - Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related
Rewards [7.0997346625024]
We formalize a non-stationary and delayed semi-bandit problem with causally related rewards.
We develop a policy that learns the structural dependencies from delayed feedback and utilizes that to optimize the decision-making.
We evaluate our method via numerical analysis using synthetic and real-world datasets to detect the regions that contribute the most to the spread of Covid-19 in Italy.
arXiv Detail & Related papers (2023-07-18T09:22:33Z) - Stochastic Submodular Bandits with Delayed Composite Anonymous Bandit
Feedback [39.12903814606534]
This paper investigates the problem of multiarmed bandits with submodular (in expectation) rewards and full-bandit delayed feedback.
The delayed feedback is composed of components of rewards from past actions, with unknown division among the sub-components.
The considered algorithm is demonstrated to outperform other full-bandit approaches with delayed composite anonymous feedback.
arXiv Detail & Related papers (2023-03-23T18:38:33Z) - A Reduction-based Framework for Sequential Decision Making with Delayed
Feedback [53.79893086002961]
We study delayed feedback in general multi-agent sequential decision making.
We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm.
arXiv Detail & Related papers (2023-02-03T01:16:09Z) - Online Nonsubmodular Minimization with Delayed Costs: From Full
Information to Bandit Feedback [98.7678704343537]
We focus on a class of nonsubmodular functions with special structure, and prove regret guarantees for several variants of the online and approximate online bandit gradient descent algorithms.
We derive bounds for the agent's regret in the full information and bandit feedback setting, even if the delay between choosing a decision and receiving the incurred cost is unbounded.
arXiv Detail & Related papers (2022-05-15T08:27:12Z) - Nonstochastic Bandits and Experts with Arm-Dependent Delays [17.272515865592542]
We study nonstochastic bandits and experts in a delayed setting where delays depend on both time and arms.
Our analyses hinge on a novel bound on the drift, measuring how much better an algorithm can perform when given a look-ahead of one round.
arXiv Detail & Related papers (2021-11-02T13:36:11Z) - Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased.
We propose a novel approach for dueling bandits based on information-directed sampling (IDS)
Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.