Provably Efficient Safe Exploration via Primal-Dual Policy Optimization
- URL: http://arxiv.org/abs/2003.00534v2
- Date: Mon, 26 Oct 2020 02:35:18 GMT
- Title: Provably Efficient Safe Exploration via Primal-Dual Policy Optimization
- Authors: Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, Mihailo R.
Jovanovi\'c
- Abstract summary: We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
- Score: 105.7510838453122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the Safe Reinforcement Learning (SRL) problem using the Constrained
Markov Decision Process (CMDP) formulation in which an agent aims to maximize
the expected total reward subject to a safety constraint on the expected total
value of a utility function. We focus on an episodic setting with the function
approximation where the Markov transition kernels have a linear structure but
do not impose any additional assumptions on the sampling model. Designing SRL
algorithms with provable computational and statistical efficiency is
particularly challenging under this setting because of the need to incorporate
both the safety constraint and the function approximation into the fundamental
exploitation/exploration tradeoff. To this end, we present an
\underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy
\underline{OP}timization (OPDOP) algorithm where the value function is
estimated by combining the least-squares policy evaluation and an additional
bonus term for safe exploration. We prove that the proposed algorithm achieves
an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$
constraint violation, where $d$ is the dimension of the feature mapping, $H$ is
the horizon of each episode, and $T$ is the total number of steps. These bounds
hold when the reward/utility functions are fixed but the feedback after each
episode is bandit. Our bounds depend on the capacity of the state-action space
only through the dimension of the feature mapping and thus our results hold
even when the number of states goes to infinity. To the best of our knowledge,
we provide the first provably efficient online policy optimization algorithm
for CMDP with safe exploration in the function approximation setting.
Related papers
- Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm [42.83837408373223]
We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces.
We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation.
arXiv Detail & Related papers (2022-06-12T22:31:43Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Safe Exploration for Constrained Reinforcement Learning with Provable
Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm.
We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Safe Learning under Uncertain Objectives and Constraints [66.05180398174286]
We consider non-textitunknown yet safety-critical optimization problems under textitunknown yet safety-critical constraints.
Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures.
A crucial component of our analysis is to introduce and apply a technique called shrinkage in the context of safe optimization.
arXiv Detail & Related papers (2020-06-23T20:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.