Near-Optimal Sample Complexity Bounds for Constrained MDPs
- URL: http://arxiv.org/abs/2206.06270v1
- Date: Mon, 13 Jun 2022 15:58:14 GMT
- Title: Near-Optimal Sample Complexity Bounds for Constrained MDPs
- Authors: Sharan Vaswani, Lin F. Yang, Csaba Szepesv\'ari
- Abstract summary: We provide minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP.
We show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.
- Score: 25.509556551558834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In contrast to the advances in characterizing the sample complexity for
solving Markov decision processes (MDPs), the optimal statistical complexity
for solving constrained MDPs (CMDPs) remains unknown. We resolve this question
by providing minimax upper and lower bounds on the sample complexity for
learning near-optimal policies in a discounted CMDP with access to a generative
model (simulator). In particular, we design a model-based algorithm that
addresses two settings: (i) relaxed feasibility, where small constraint
violations are allowed, and (ii) strict feasibility, where the output policy is
required to satisfy the constraint. For (i), we prove that our algorithm
returns an $\epsilon$-optimal policy with probability $1 - \delta$, by making
$\tilde{O}\left(\frac{S A \log(1/\delta)}{(1 - \gamma)^3 \epsilon^2}\right)$
queries to the generative model, thus matching the sample-complexity for
unconstrained MDPs. For (ii), we show that the algorithm's sample complexity is
upper-bounded by $\tilde{O} \left(\frac{S A \, \log(1/\delta)}{(1 - \gamma)^5
\, \epsilon^2 \zeta^2} \right)$ where $\zeta$ is the problem-dependent Slater
constant that characterizes the size of the feasible region. Finally, we prove
a matching lower-bound for the strict feasibility setting, thus obtaining the
first near minimax optimal bounds for discounted CMDPs. Our results show that
learning CMDPs is as easy as MDPs when small constraint violations are allowed,
but inherently more difficult when we demand zero constraint violation.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Towards Instance-Optimality in Online PAC Reinforcement Learning [28.156332484814616]
We propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy.
We demonstrate that the sample complexity of the PEDEL algorithm of citeWagenmaker22linearMDP closely approaches this lower bound.
arXiv Detail & Related papers (2023-10-31T19:26:36Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP [12.37249250512371]
Constrained Markov Decision Process (CMDP) is an important framework for safe Reinforcement Learning.
In this paper, we focus on solving the CMDP problems where only offline data are available.
By adopting the concept of the single-policy concentrability coefficient $C*$, we establish an $Omegaleft(fracminleft|mathcalS||mathcalA|,|mathcalS|+Iright C*(fracminleft|mathcalS
arXiv Detail & Related papers (2022-07-13T12:13:38Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Towards Tight Bounds on the Sample Complexity of Average-reward MDPs [39.01663172393174]
We find an optimal policy of an infinite-horizon average-reward Markov decision process given access to a generative model.
We provide an algorithm that solves the problem using $widetildeO(t_mathrmmix epsilon-3)$ (oblivious) samples per state-action pair.
arXiv Detail & Related papers (2021-06-13T17:18:11Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.