Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model
- URL: http://arxiv.org/abs/2507.02089v1
- Date: Wed, 02 Jul 2025 19:07:37 GMT
- Title: Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model
- Authors: Xingtu Liu, Lin F. Yang, Sharan Vaswani,
- Abstract summary: We consider infinite-horizon $gamma$-discounted (linear) constrained Markov decision processes (CMDPs)<n>The objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints.<n>We propose a primal-dual framework that can leverage any black-box unconstrained MDP solver.
- Score: 16.578348944264505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider infinite-horizon $\gamma$-discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension $d$, we instantiate the framework by using mirror descent value iteration (\texttt{MDVI})~\citep{kitamura2023regularization} an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an $\epsilon$-optimal policy with high probability by using $\tilde{O}\left(\frac{d^2}{(1-\gamma)^4\epsilon^2}\right)$ samples. We note that these results exhibit a near-optimal dependence on both $d$ and $\epsilon$. For (ii), we show that the algorithm requires $\tilde{O}\left(\frac{d^2}{(1-\gamma)^6\epsilon^2\zeta^2}\right)$ samples, where $\zeta$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.
Related papers
- Primal-Dual Sample Complexity Bounds for Constrained Markov Decision Processes with Multiple Constraints [0.0]
This paper addresses the challenge of solving Constrained Markov Decision Processes (CMDPs) with $d > 1$ constraints when the transition dynamics are unknown.<n>We propose a model-based algorithm for infinite horizon CMDPs with multiple constraints, aiming to derive and prove sample complexity bounds for learning near-optimal policies.
arXiv Detail & Related papers (2025-03-09T20:10:35Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
We establish the complexity bound $widetildeOleft(SAfracH (1-gamma)2varepsilon2 right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2023-11-22T15:34:44Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Near-Optimal Sample Complexity Bounds for Constrained MDPs [25.509556551558834]
We provide minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP.
We show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.
arXiv Detail & Related papers (2022-06-13T15:58:14Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Towards Tight Bounds on the Sample Complexity of Average-reward MDPs [39.01663172393174]
We find an optimal policy of an infinite-horizon average-reward Markov decision process given access to a generative model.
We provide an algorithm that solves the problem using $widetildeO(t_mathrmmix epsilon-3)$ (oblivious) samples per state-action pair.
arXiv Detail & Related papers (2021-06-13T17:18:11Z) - Efficiently Solving MDPs with Stochastic Mirror Descent [38.30919646721354]
We present a unified framework for approximately solving infinite-horizon Markov decision processes (MDPs) given a linear model.
We achieve these results through a more general mirror descent framework for solving bigenerative saddle-point problems with simplex and box domains.
arXiv Detail & Related papers (2020-08-28T17:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.