Related papers: Provably Efficient Sample Complexity for Robust CMDP

Provably Efficient Sample Complexity for Robust CMDP

URL: http://arxiv.org/abs/2511.07486v1
Date: Wed, 12 Nov 2025 01:01:26 GMT
Title: Provably Efficient Sample Complexity for Robust CMDP
Authors: Sourav Ganguly, Arnob Ghosh,
Abstract summary: We study the problem of learning policies that maximize cumulative reward while satisfying safety constraints.<n>We focus on robust constrained Markov decision processes (RCMDPs), where the agent must maximize reward while ensuring cumulative utility exceeds a threshold.<n>We propose a novel Robust constrained Value iteration (RCVI) algorithm with a sample complexity of $mathcaltildeO(|S||A|H5/2)$ achieving at most $$ violations.
Score: 7.060086147428817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of learning policies that maximize cumulative reward while satisfying safety constraints, even when the real environment differs from a simulator or nominal model. We focus on robust constrained Markov decision processes (RCMDPs), where the agent must maximize reward while ensuring cumulative utility exceeds a threshold under the worst-case dynamics within an uncertainty set. While recent works have established finite-time iteration complexity guarantees for RCMDPs using policy optimization, their sample complexity guarantees remain largely unexplored. In this paper, we first show that Markovian policies may fail to be optimal even under rectangular uncertainty sets unlike the {\em unconstrained} robust MDP. To address this, we introduce an augmented state space that incorporates the remaining utility budget into the state representation. Building on this formulation, we propose a novel Robust constrained Value iteration (RCVI) algorithm with a sample complexity of $\mathcal{\tilde{O}}(|S||A|H^5/ε^2)$ achieving at most $ε$ violation using a generative model where $|S|$ and $|A|$ denote the sizes of the state and action spaces, respectively, and $H$ is the episode length. To the best of our knowledge, this is the {\em first sample complexity guarantee} for RCMDP. Empirical results further validate the effectiveness of our approach.

Related papers

Provably Efficient Algorithms for S- and Non-Rectangular Robust MDPs with General Parameterization [85.91302339486673]
We study robust Markov decision processes (RMDPs) with general policy parameterization under s-rectangular and non-rectangular uncertainty sets.<n>We prove novel Lipschitz and Lipschitz-smoothness properties for general policy parameterizations that extends to infinite state spaces.<n>We design a projected gradient descent algorithm for s-rectangular uncertainty and a Frank-Wolfe algorithm for non-rectangular uncertainty.
arXiv Detail & Related papers (2026-02-11T21:44:20Z)
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum [62.691095807959215]
We establish an optimal sample complexity of $O(-2)$ for obtaining an $$-optimal global policy using a single-timescale actor-critic (AC) algorithm.<n>These mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.
arXiv Detail & Related papers (2026-02-02T00:35:42Z)
Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees [16.01190705000295]
Constrained decision-making is essential for designing safe policies in real-world control systems.<n>We propose a novel technique that effectively minimizes the constraint value function--to satisfy the constraints.<n>We prove that such an algorithm finds a policy with at most $epsilon$ sub-optimality and feasible policy after $O(epsilon-2)$.
arXiv Detail & Related papers (2025-05-25T17:27:06Z)
Provably Sample-Efficient Robust Reinforcement Learning with Average Reward [4.530028899565083]
We propose a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $ell_p$-norm and contamination models.<n>Our algorithm operates without requiring any prior knowledge of the robust MDP.<n>Our work provides essential theoretical understanding of sample efficiency of robust average reward RL.
arXiv Detail & Related papers (2025-05-18T15:34:45Z)
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning [10.708457894356563]
We propose two algorithms that achieve near-optimal sample complexity.<n>We prove that both algorithms attain a sample complexity of $widetildeOleft(|mathbfS||mathbfA| t_mathrmmix2varepsilon-2right)$ for estimating the optimal policy.<n>This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning.
arXiv Detail & Related papers (2025-05-15T06:42:25Z)
Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation [32.74649239695449]
We study the reinforcement learning problem in a constrained decision process (CMDP)<n>We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee.<n>Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
arXiv Detail & Related papers (2025-02-14T13:07:25Z)
Sample-Efficient Constrained Reinforcement Learning with General Parameterization [35.22742439337603]
We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon. We develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that ensures an $epsilon$ global optimality gap and $epsilon$ constraint violation.
arXiv Detail & Related papers (2024-05-17T08:39:05Z)
The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model [71.59406356321101]
This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice.<n>We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimize the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP.
arXiv Detail & Related papers (2023-05-26T02:32:03Z)
Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy. We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z)
Near-Optimal Sample Complexity Bounds for Constrained MDPs [25.509556551558834]
We provide minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP. We show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.
arXiv Detail & Related papers (2022-06-13T15:58:14Z)
Efficient Policy Iteration for Robust Markov Decision Processes via Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known. Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs. In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.