Non-stochastic Bandits With Evolving Observations
- URL: http://arxiv.org/abs/2405.16843v1
- Date: Mon, 27 May 2024 05:32:46 GMT
- Title: Non-stochastic Bandits With Evolving Observations
- Authors: Yogev Bar-On, Yishay Mansour,
- Abstract summary: We introduce a novel online learning framework that unifies and generalizes pre-established models.
We propose regret minimization algorithms for both the full-information and bandit settings.
Our algorithms match the known regret bounds across many special cases, while also introducing previously unknown bounds.
- Score: 47.61533665679308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel online learning framework that unifies and generalizes pre-established models, such as delayed and corrupted feedback, to encompass adversarial environments where action feedback evolves over time. In this setting, the observed loss is arbitrary and may not correlate with the true loss incurred, with each round updating previous observations adversarially. We propose regret minimization algorithms for both the full-information and bandit settings, with regret bounds quantified by the average feedback accuracy relative to the true loss. Our algorithms match the known regret bounds across many special cases, while also introducing previously unknown bounds.
Related papers
- Accelerated Rates between Stochastic and Adversarial Online Convex
Optimization [2.628557920905129]
We establish novel regret bounds for online convex optimization in a setting that interpolates between i.i.d. and fully adversarial losses.
In the fully i.i.d. case, our regret bounds match the rates one would expect from results in acceleration, and we also recover the optimalally accelerated rates via online-to-batch conversion.
arXiv Detail & Related papers (2023-03-06T16:41:57Z) - Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode.
Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information.
We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z) - Regret Minimization and Convergence to Equilibria in General-sum Markov
Games [57.568118148036376]
We present the first algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents.
Our algorithm is decentralized, computationally efficient, and does not require any communication between agents.
arXiv Detail & Related papers (2022-07-28T16:27:59Z) - Second Order Regret Bounds Against Generalized Expert Sequences under
Partial Bandit Feedback [0.0]
We study the problem of expert advice under partial bandit feedback setting and create a sequential minimax optimal algorithm.
Our algorithm works with a more general partial monitoring setting, where, in contrast to the classical bandit feedback, the losses can be revealed in an adversarial manner.
arXiv Detail & Related papers (2022-04-13T22:48:12Z) - Nonstochastic Bandits and Experts with Arm-Dependent Delays [17.272515865592542]
We study nonstochastic bandits and experts in a delayed setting where delays depend on both time and arms.
Our analyses hinge on a novel bound on the drift, measuring how much better an algorithm can perform when given a look-ahead of one round.
arXiv Detail & Related papers (2021-11-02T13:36:11Z) - A Regret Minimization Approach to Iterative Learning Control [61.37088759497583]
We propose a new performance metric, planning regret, which replaces the standard uncertainty assumptions with worst case regret.
We provide theoretical and empirical evidence that the proposed algorithm outperforms existing methods on several benchmarks.
arXiv Detail & Related papers (2021-02-26T13:48:49Z) - Online non-convex optimization with imperfect feedback [33.80530308979131]
We consider the problem of online learning with non- losses.
In terms of feedback, we assume that the learner observes - or otherwise constructs - an inexact model for the loss function at each stage.
We propose a mixed-strategy learning policy based on dual averaging.
arXiv Detail & Related papers (2020-10-16T16:53:13Z) - Temporal Variability in Implicit Online Learning [15.974402990630402]
tightest regret analyses only show marginal improvements over Online Mirror Descent.
We prove a novel static regret bound that depends on the temporal variability of the sequence of loss functions.
We present an adaptive algorithm that achieves this regret bound without prior knowledge of the temporal variability.
arXiv Detail & Related papers (2020-06-12T22:50:34Z) - Prediction with Corrupted Expert Advice [67.67399390910381]
We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in a benign environment.
Our results reveal a surprising disparity between the often comparable Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) frameworks.
arXiv Detail & Related papers (2020-02-24T14:39:55Z) - Corruption-robust exploration in episodic reinforcement learning [76.19192549843727]
We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions.
Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
arXiv Detail & Related papers (2019-11-20T03:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.