Related papers: Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

URL: http://arxiv.org/abs/2602.08000v1
Date: Sun, 08 Feb 2026 14:54:02 GMT
Title: Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization
Authors: Anirudh Satheesh, Vaneet Aggarwal,
Abstract summary: We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations.<n>We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles.
Score: 47.72469270565647
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as $\tilde{O}(\sqrt{T})$, up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.

Related papers

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum [62.691095807959215]
We establish an optimal sample complexity of $O(-2)$ for obtaining an $$-optimal global policy using a single-timescale actor-critic (AC) algorithm.<n>These mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.
arXiv Detail & Related papers (2026-02-02T00:35:42Z)
Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach [46.80389197344682]
We propose a Natural Actor-Critic with order-optimal regret of $tildeO(sqrtT)$ in infinite-reward average-reward Decision Processes.<n> NACB employs function approximation for both actor and the critic, enabling scalability to large state potential periodicity and action spaces.
arXiv Detail & Related papers (2025-05-26T13:43:02Z)
Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning [50.81240969750462]
We present first finite-sample analysis of policy evaluation in robust average Markov Decision Processes (PMDs)<n>We show that the robust Bellman operator is a contraction under a carefully constructed semi-norm, and developing a framework with controlled bias.<n>Our method achieves the order-optimal sample complexity of $tildemathcalO(epsilon-2)$ for robust policy evaluation and robust average reward estimation.
arXiv Detail & Related papers (2025-02-24T03:55:09Z)
Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs [16.49229317664822]
We study the problem of infinite-horizon average-reward reinforcement learning with linear decision processes (MDPs)<n>Our approach approximates the average-reward setting by a discounted discounting factor, then applies an optimistic value iteration.
arXiv Detail & Related papers (2024-05-23T20:58:33Z)
Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability [48.62272919754204]
We study generalization in Bayesian RL under the probably approximately correct (PAC) framework. Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense.
arXiv Detail & Related papers (2021-09-24T07:48:34Z)
A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems. "vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.