Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs
- URL: http://arxiv.org/abs/2503.19856v2
- Date: Thu, 26 Jun 2025 16:47:52 GMT
- Title: Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs
- Authors: Alexander Ryabchenko, Idan Attias, Daniel M. Roy,
- Abstract summary: We study online learning with oblivious delays under a novel clairvoyance'' that limits how many past rounds can be tracked simultaneously for delayed feedback.<n>Our algorithms achieve minimax regrets across all capacity levels, with performance degrading gracefully under suboptimal capacity.
- Score: 60.7808741738461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study online learning with oblivious losses and delays under a novel ``capacity constraint'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Under ``clairvoyance'' (i.e., delay durations are revealed upfront each round) and/or ``preemptibility'' (i.e., we can stop tracking previously chosen round feedback), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the ``optimal capacity'' needed to match the minimax rates of classical delayed online learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C = \Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + DK/C + D\log(K)})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for full-information feedback. When replacing clairvoyance with preemptibility, we require a known maximum delay bound $d_{\max}$, adding ${\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$ (i.e., $D=Td$), the minimax regret is $\Theta(\sqrt{TK(1+d/C)+Td\log(K)})$ and the optimal capacity is $\Theta(\min\{K/\log(K),d\})$ in the bandit setting, while in the full-information feedback setting, the minimax regret is $\Theta(\sqrt{T(d+1)\log(K)})$ and the optimal capacity is $\Theta(1)$. For round-dependent and fixed delays, our upper bounds are achieved using novel preemptive and non-preemptive scheduling policies, based on Pareto-distributed proxy delays, and batching techniques, respectively. Crucially, our work unifies delayed bandits, label-efficient learning, and online scheduling frameworks, demonstrating that robust online learning under delayed feedback is possible with surprisingly modest tracking capacity.
Related papers
- Exploiting Curvature in Online Convex Optimization with Delayed Feedback [6.390468088226495]
We study the online convex optimization problem with curved losses and delayed feedback.<n>We propose a variant of follow-the-regularized-leader that obtains regret of order $minsigma_maxln T, sqrtd_mathrmtot$.<n>We then consider exp-concave losses and extend the Online Newton Step algorithm to handle delays with an adaptive learning rate tuning.
arXiv Detail & Related papers (2025-06-09T09:49:54Z) - Delay as Payoff in MAB [40.65658965074464]
We investigate a variant of the classical Multi-armed Bandit (MAB) problem, where the payoff received by an agent is both delayed, and directly corresponds to the magnitude of the delay.
Our main contributions are tight upper and lower bounds for both the cost and reward settings.
Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward.
arXiv Detail & Related papers (2024-08-27T15:52:52Z) - Learning to Cover: Online Learning and Optimization with Irreversible Decisions [50.5775508521174]
We define an online learning and optimization problem with discrete and irreversible decisions contributing toward a coverage target.<n>In each period, a decision-maker selects facilities to open, receives information on the success of each one, and updates a classification model to guide future decisions.<n>The goal is to minimize facility openings under a chance constraint reflecting the coverage target, in an horizon regime characterized by a large target number of facilities.
arXiv Detail & Related papers (2024-06-20T23:00:25Z) - Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback [25.68113242132723]
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback.
We simultaneously achieves a near-optimal adversarial regret guarantee in the setting with fixed delays.
We also present an extension of the algorithm to the case of arbitrary delays.
arXiv Detail & Related papers (2022-06-29T20:49:45Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays [21.94728545221709]
We consider the Scale-Free Adversarial Multi Armed Bandit (MAB) problem with unrestricted feedback delays.
textttSFBanker achieves $mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps and $D$ is the total feedback delay.
arXiv Detail & Related papers (2021-10-26T04:06:51Z) - Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits [45.43968161616453]
We study the optimal batch-regret tradeoff for batch linear contextual bandits.
We prove its regret guarantee, which features a two-phase expression as the time horizon $T$ grows.
We also prove a new matrix inequality concentration with dependence on their dynamic upper bounds.
arXiv Detail & Related papers (2021-10-15T12:32:33Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Online Convex Optimization with Continuous Switching Constraint [78.25064451417082]
We introduce the problem of online convex optimization with continuous switching constraint.
We show that, for strongly convex functions, the regret bound can be improved to $O(log T)$ for $S=Omega(log T)$, and $O(minT/exp(S)+S,T)$ for $S=O(log T)$.
arXiv Detail & Related papers (2021-03-21T11:43:35Z) - Adapting to Delays and Data in Adversarial Multi-Armed Bandits [7.310043452300736]
We analyze variants of the Exp3 algorithm that tune their step-size using only information available at the time of the decisions.
We obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses.
arXiv Detail & Related papers (2020-10-12T20:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.