Related papers: Scale-free Adversarial Reinforcement Learning

Scale-free Adversarial Reinforcement Learning

URL: http://arxiv.org/abs/2403.00930v1
Date: Fri, 1 Mar 2024 19:21:10 GMT
Title: Scale-free Adversarial Reinforcement Learning
Authors: Mingyu Chen, Xuezhou Zhang
Abstract summary: This paper initiates the study of scale-free learning in Markov Decision Processes (MDPs) We design a generic algorithmic framework, underlineScale underlineClipping underlineBound (textttSCB) We achieve the first minimax optimal expected regret bound and the first high-probability regret bound in scale-free adversarial MABs.
Score: 17.276918882127728
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper initiates the study of scale-free learning in Markov Decision Processes (MDPs), where the scale of rewards/losses is unknown to the learner. We design a generic algorithmic framework, \underline{S}cale \underline{C}lipping \underline{B}ound (\texttt{SCB}), and instantiate this framework in both the adversarial Multi-armed Bandit (MAB) setting and the adversarial MDP setting. Through this framework, we achieve the first minimax optimal expected regret bound and the first high-probability regret bound in scale-free adversarial MABs, resolving an open problem raised in \cite{hadiji2023adaptation}. On adversarial MDPs, our framework also give birth to the first scale-free RL algorithm with a $\tilde{\mathcal{O}}(\sqrt{T})$ high-probability regret guarantee.

Related papers

Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs [56.246783503873225]
This paper revisits the weighted strategy for non-stationary parametric bandits.<n>We propose a simpler weight-based algorithm that is as efficient as window/restart-based algorithms.<n>Our framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2026-01-03T04:50:21Z)
Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback [61.49239204705301]
We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model.<n>Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems.
arXiv Detail & Related papers (2025-10-20T02:28:08Z)
Best-of-Both Worlds for linear contextual bandits with paid observations [16.13456643813766]
We introduce a computationally efficient Best-of-Both-Worlds (BOBW) algorithm for this problem.<n>We show that it achieves the minimax-optimal regret of $Theta(T2/3)$ in adversarial settings, while guaranteeing poly-logarithmic regret in (corrupted) regimes.
arXiv Detail & Related papers (2025-10-08T18:23:37Z)
Q-Learning with Fine-Grained Gap-Dependent Regret [13.370933509246568]
Existing model-free algorithms achieve minimax worst-case regret, but their gap-dependent bounds remain coarse and fail to fully capture the structure of suboptimality gaps.<n>We establish fine-grained gap-dependent regret bounds for both UCB-based and non-UCB-based algorithms.
arXiv Detail & Related papers (2025-10-08T05:02:16Z)
Experimental Design for Semiparametric Bandits [11.156009461711639]
We study finite-armed semiparametric bandits, where each arm's reward combines a linear component with an unknown, potentially adversarial shift.<n>We propose the first experimental-design approach that simultaneously offers a sharp regret bound, a PAC bound, and a best-arm identification guarantee.
arXiv Detail & Related papers (2025-06-16T11:53:00Z)
uniINF: Best-of-Both-Worlds Algorithm for Parameter-Free Heavy-Tailed MABs [33.262918224598614]
We present a novel algorithm for the Heavy-Tailed Multi-Armed Bandits (HTMAB) problem, demonstrating robustness and adaptability. Our novel algorithm uni enjoys the so-called Best-of-Both-Worlds (BoBW) property, performing optimally in both and adversarial environments. To our knowledge, uniINF is the first parameter-free algorithm to achieve the BoBW property for the heavy-tailed MAB problem.
arXiv Detail & Related papers (2024-10-04T09:55:44Z)
Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z)
Meta-Learning Adversarial Bandit Algorithms [55.72892209124227]
We study online meta-learning with bandit feedback. We learn to tune online mirror descent generalization (OMD) with self-concordant barrier regularizers.
arXiv Detail & Related papers (2023-07-05T13:52:10Z)
Meta-Learning Adversarial Bandits [49.094361442409785]
We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a meta-algorithm that setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit optimization (BLO) Our guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-B sequence.
arXiv Detail & Related papers (2022-05-27T17:40:32Z)
On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function. We propose a general framework for both algorithm performance analysis. We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z)
Online Learning with Knapsacks: the Best of Both Worlds [54.28273783164608]
We casting online learning problems in which a decision maker wants to maximize their expected reward without violating a finite set of $m$m resource constraints. Our framework allows the decision maker to handle its evidence flexibility and costoretic functions.
arXiv Detail & Related papers (2022-02-28T12:10:48Z)
First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces. We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.