Scale-free Adversarial Reinforcement Learning
- URL: http://arxiv.org/abs/2403.00930v1
- Date: Fri, 1 Mar 2024 19:21:10 GMT
- Title: Scale-free Adversarial Reinforcement Learning
- Authors: Mingyu Chen, Xuezhou Zhang
- Abstract summary: This paper initiates the study of scale-free learning in Markov Decision Processes (MDPs)
We design a generic algorithmic framework, underlineScale underlineClipping underlineBound (textttSCB)
We achieve the first minimax optimal expected regret bound and the first high-probability regret bound in scale-free adversarial MABs.
- Score: 17.276918882127728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper initiates the study of scale-free learning in Markov Decision
Processes (MDPs), where the scale of rewards/losses is unknown to the learner.
We design a generic algorithmic framework, \underline{S}cale
\underline{C}lipping \underline{B}ound (\texttt{SCB}), and instantiate this
framework in both the adversarial Multi-armed Bandit (MAB) setting and the
adversarial MDP setting. Through this framework, we achieve the first minimax
optimal expected regret bound and the first high-probability regret bound in
scale-free adversarial MABs, resolving an open problem raised in
\cite{hadiji2023adaptation}. On adversarial MDPs, our framework also give birth
to the first scale-free RL algorithm with a $\tilde{\mathcal{O}}(\sqrt{T})$
high-probability regret guarantee.
Related papers
- Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs [56.246783503873225]
This paper revisits the weighted strategy for non-stationary parametric bandits.<n>We propose a simpler weight-based algorithm that is as efficient as window/restart-based algorithms.<n>Our framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2026-01-03T04:50:21Z) - Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback [61.49239204705301]
We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model.<n>Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems.
arXiv Detail & Related papers (2025-10-20T02:28:08Z) - Best-of-Both Worlds for linear contextual bandits with paid observations [16.13456643813766]
We introduce a computationally efficient Best-of-Both-Worlds (BOBW) algorithm for this problem.<n>We show that it achieves the minimax-optimal regret of $Theta(T2/3)$ in adversarial settings, while guaranteeing poly-logarithmic regret in (corrupted) regimes.
arXiv Detail & Related papers (2025-10-08T18:23:37Z) - Q-Learning with Fine-Grained Gap-Dependent Regret [13.370933509246568]
Existing model-free algorithms achieve minimax worst-case regret, but their gap-dependent bounds remain coarse and fail to fully capture the structure of suboptimality gaps.<n>We establish fine-grained gap-dependent regret bounds for both UCB-based and non-UCB-based algorithms.
arXiv Detail & Related papers (2025-10-08T05:02:16Z) - Experimental Design for Semiparametric Bandits [11.156009461711639]
We study finite-armed semiparametric bandits, where each arm's reward combines a linear component with an unknown, potentially adversarial shift.<n>We propose the first experimental-design approach that simultaneously offers a sharp regret bound, a PAC bound, and a best-arm identification guarantee.
arXiv Detail & Related papers (2025-06-16T11:53:00Z) - uniINF: Best-of-Both-Worlds Algorithm for Parameter-Free Heavy-Tailed MABs [33.262918224598614]
We present a novel algorithm for the Heavy-Tailed Multi-Armed Bandits (HTMAB) problem, demonstrating robustness and adaptability.
Our novel algorithm uni enjoys the so-called Best-of-Both-Worlds (BoBW) property, performing optimally in both and adversarial environments.
To our knowledge, uniINF is the first parameter-free algorithm to achieve the BoBW property for the heavy-tailed MAB problem.
arXiv Detail & Related papers (2024-10-04T09:55:44Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - Meta-Learning Adversarial Bandit Algorithms [55.72892209124227]
We study online meta-learning with bandit feedback.
We learn to tune online mirror descent generalization (OMD) with self-concordant barrier regularizers.
arXiv Detail & Related papers (2023-07-05T13:52:10Z) - Meta-Learning Adversarial Bandits [49.094361442409785]
We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure.
As the first to target the adversarial setting, we design a meta-algorithm that setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit optimization (BLO)
Our guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-B sequence.
arXiv Detail & Related papers (2022-05-27T17:40:32Z) - On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function.
We propose a general framework for both algorithm performance analysis.
We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z) - Online Learning with Knapsacks: the Best of Both Worlds [54.28273783164608]
We casting online learning problems in which a decision maker wants to maximize their expected reward without violating a finite set of $m$m resource constraints.
Our framework allows the decision maker to handle its evidence flexibility and costoretic functions.
arXiv Detail & Related papers (2022-02-28T12:10:48Z) - First-Order Regret in Reinforcement Learning with Linear Function
Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces.
We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.