Related papers: Learning-Augmented Algorithms for MTS with Bandit Access to Multiple Predictors

Learning-Augmented Algorithms for MTS with Bandit Access to Multiple Predictors

URL: http://arxiv.org/abs/2506.05479v1
Date: Thu, 05 Jun 2025 18:00:37 GMT
Title: Learning-Augmented Algorithms for MTS with Bandit Access to Multiple Predictors
Authors: Matei Gabriel Coşa, Marek Eliáš,
Abstract summary: We show how to achieve regret of $O(text2/3)$ and prove a tight lower bound based on the construction of Dekel et al.<n>This is related to Learning against memory bounded adversaries.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We consider the following problem: We are given $\ell$ heuristics for Metrical Task Systems (MTS), where each might be tailored to a different type of input instances. While processing an input instance received online, we are allowed to query the action of only one of the heuristics at each time step. Our goal is to achieve performance comparable to the best of the given heuristics. The main difficulty of our setting comes from the fact that the cost paid by a heuristic at time $t$ cannot be estimated unless the same heuristic was also queried at time $t-1$. This is related to Bandit Learning against memory bounded adversaries (Arora et al., 2012). We show how to achieve regret of $O(\text{OPT}^{2/3})$ and prove a tight lower bound based on the construction of Dekel et al. (2013).

Related papers

From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
A New Benchmark for Online Learning with Budget-Balancing Constraints [14.818946657685267]
We present a new benchmark to compare against, motivated both by real-world applications such as autobidding and by its underlying mathematical structure.<n>We show that sublinear regret is attainable against any strategy whose spending pattern is within $o(T2)$ of any sub-linear spending pattern.
arXiv Detail & Related papers (2025-03-19T00:14:20Z)
Efficient Reinforcement Learning in Probabilistic Reward Machines [15.645062155050732]
We design an algorithm for Probabilistic Reward Machines (PRMs) that achieves a regret bound of $widetildeO(sqrtHOAT)$. We also present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward.
arXiv Detail & Related papers (2024-08-19T19:51:53Z)
Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z)
Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting. In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z)
Tractable Optimality in Episodic Latent MABs [75.17357040707347]
We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
arXiv Detail & Related papers (2022-10-05T22:53:46Z)
Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation. We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z)
No Regrets for Learning the Prior in Bandits [30.478188057004175]
$tt AdaTS$ is a Thompson sampling algorithm that adapts sequentially to bandit tasks. $tt AdaTS$ is a fully-Bayesian algorithm that can be implemented efficiently in several classes of bandit problems.
arXiv Detail & Related papers (2021-07-13T15:51:32Z)
Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. The goal is to find a policy that matches the expert's performance on some predefined set of cost functions. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition [38.28925339231888]
We develop the first algorithm with a best-of-both-worlds'' guarantee. It achieves $mathcalO(log T)$ regret when the losses are adversarial. More generally, it achieves $tildemathcalO(sqrtC)$ regret in an intermediate setting.
arXiv Detail & Related papers (2020-06-10T01:59:34Z)
Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial Rewards [59.559028338399855]
We consider the problem of sleeping bandits with action sets and adversarial rewards. In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(sqrtT)$.
arXiv Detail & Related papers (2020-04-14T00:41:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.