Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation
- URL: http://arxiv.org/abs/2502.10138v2
- Date: Tue, 18 Feb 2025 02:30:30 GMT
- Title: Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation
- Authors: Toshinori Kitamura, Arnob Ghosh, Tadashi Kozuno, Wataru Kumagai, Kazumi Kasaura, Kenta Hoshino, Yohei Hosoe, Yutaka Matsuo,
- Abstract summary: We study the reinforcement learning problem in a constrained decision process (CMDP)
We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee.
Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
- Score: 24.299769025346368
- License:
- Abstract: We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $\tilde{\mathcal{O}}(\sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
Related papers
- Sample and Oracle Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions [10.225358400539719]
We present an efficient reinforcement algorithm for Decision (MDPs) where a linear-action-action generalizes in a feature map.
Specifically, we introduce a new algorithm that efficiently finds a near-optimal policy in this setting.
arXiv Detail & Related papers (2024-09-07T14:38:05Z) - A safe exploration approach to constrained Markov decision processes [7.036452261968767]
We consider discounted infinite horizon constrained Markov decision processes (CMDPs)
The goal is to find an optimal policy that maximizes the expected cumulative reward subject to expected cumulative constraints.
Motivated by the application of CMDPs in online learning of safety-critical systems, we focus on developing a model-free and simulator-free algorithm.
arXiv Detail & Related papers (2023-12-01T13:16:39Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Anytime-Constrained Reinforcement Learning [6.981971551979697]
We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints.
We show that there exist optimal deterministic policies augmented with cumulative costs.
We show that computing non-trivial approximately optimal policies is NP-hard in general.
arXiv Detail & Related papers (2023-11-09T16:51:26Z) - Provably Efficient Model-Free Algorithms for Non-stationary CMDPs [10.930095238723327]
We study model-free reinforcement learning algorithms in episodic non-stationary constrained Markov Decision Processes.
In the non-stationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets.
We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs.
arXiv Detail & Related papers (2023-03-10T06:33:38Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Settling the Horizon-Dependence of Sample Complexity in Reinforcement
Learning [82.31436758872715]
We develop an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions.
We establish a connection between value functions in discounted and finite-horizon Markov decision processes.
arXiv Detail & Related papers (2021-11-01T00:21:24Z) - Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs)
Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals.
We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.