Related papers: Tail Distribution of Regret in Optimistic Reinforcement Learning

Tail Distribution of Regret in Optimistic Reinforcement Learning

URL: http://arxiv.org/abs/2511.18247v1
Date: Sun, 23 Nov 2025 02:23:09 GMT
Title: Tail Distribution of Regret in Optimistic Reinforcement Learning
Authors: Sajad Khodadadian, Mehrdad Moharrami,
Abstract summary: We characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile.<n>A proposed algorithm depends on a tuning parameter $$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail.
Score: 3.9962751777898955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on $\Pr(R_K \ge x)$ that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale $m_K$ up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $\mathbb{E}[R_K]$. The proposed algorithm depends on a tuning parameter $α$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.

Related papers

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z)
Achieving Constant Regret in Linear Markov Decision Processes [57.34287648914407]
We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs)<n>We show that Cert-LSVI-UCB has a cumulative regret of $tildemathcalO(d3H5/Delta)$ with high probability, provided that the misspecification level $zeta$ is below $tildemathcalO(Delta / (sqrtdH2))$.
arXiv Detail & Related papers (2024-04-16T17:23:19Z)
$(\epsilon, u)$-Adaptive Regret Minimization in Heavy-Tailed Bandits [29.966828248335972]
We study the regret minimization problem when $epsilon$ and $u$ are unknown to the learner. AdaR-UCB is the first algorithm that enjoys regret guarantees nearly matching those of the non-adaptive heavy-tailed case.
arXiv Detail & Related papers (2023-10-04T17:11:15Z)
Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk [26.397343668067382]
We study the optimal trade-off between expectation and tail risk for regret distribution in the multi-armed bandit model.<n>New policies are proposed to characterize the optimal regret tail probability for any regret threshold.
arXiv Detail & Related papers (2023-04-10T01:00:18Z)
Variance-aware robust reinforcement learning with linear function approximation with heavy-tailed rewards [6.932056534450556]
We present two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards. AdaOFUL achieves a state-of-the-art regret bound of $widetildemathcalObig. VarA achieves a tighter variance-aware regret bound of $widetildemathcalO(dsqrtHmathcalG*K)$.
arXiv Detail & Related papers (2023-03-09T22:16:28Z)
First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces. We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z)
Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus. We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z)
Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP) We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z)
Nonstationary Reinforcement Learning with Linear Function Approximation [19.521419943509784]
We consider reinforcement learning in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. We first develop an optimistic modification of least-squares value with periodic restart, and bound its dynamic regret when variation budgets are known. We derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al.
arXiv Detail & Related papers (2020-10-08T20:07:44Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.