Constrained Policy Gradient Method for Safe and Fast Reinforcement
Learning: a Neural Tangent Kernel Based Approach
- URL: http://arxiv.org/abs/2107.09139v1
- Date: Mon, 19 Jul 2021 20:25:15 GMT
- Title: Constrained Policy Gradient Method for Safe and Fast Reinforcement
Learning: a Neural Tangent Kernel Based Approach
- Authors: Bal\'azs Varga, Bal\'azs Kulcs\'ar, Morteza Haghir Chehreghani
- Abstract summary: This paper presents a constrained policy gradient algorithm for safe learning.
We introduce constraints for safe learning with the following steps.
The efficiency of the constrained learning was demonstrated with a shallow and wide ReLU network in the Cartpole and Lunar Lander OpenAI gym environments.
- Score: 6.316693022958221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a constrained policy gradient algorithm. We introduce
constraints for safe learning with the following steps. First, learning is
slowed down (lazy learning) so that the episodic policy change can be computed
with the help of the policy gradient theorem and the neural tangent kernel.
Then, this enables us the evaluation of the policy at arbitrary states too. In
the same spirit, learning can be guided, ensuring safety via augmenting episode
batches with states where the desired action probabilities are prescribed.
Finally, exogenous discounted sum of future rewards (returns) can be computed
at these specific state-action pairs such that the policy network satisfies
constraints. Computing the returns is based on solving a system of linear
equations (equality constraints) or a constrained quadratic program (inequality
constraints). Simulation results suggest that adding constraints (external
information) to the learning can improve learning in terms of speed and safety
reasonably if constraints are appropriately selected. The efficiency of the
constrained learning was demonstrated with a shallow and wide ReLU network in
the Cartpole and Lunar Lander OpenAI gym environments. The main novelty of the
paper is giving a practical use of the neural tangent kernel in reinforcement
learning.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time [0.6554326244334868]
We present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint.
We show that the learned policy is safe with high confidence.
We also demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.
arXiv Detail & Related papers (2024-03-23T20:22:30Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Model-based Safe Deep Reinforcement Learning via a Constrained Proximal
Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner.
We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z) - Policy Gradients for Probabilistic Constrained Reinforcement Learning [13.441235221641717]
This paper considers the problem of learning safe policies in the context of reinforcement learning (RL)
We aim to design policies that maintain the state of the system in a safe set with high probability.
arXiv Detail & Related papers (2022-10-02T18:16:33Z) - Learn Zero-Constraint-Violation Policy in Model-Free Constrained
Reinforcement Learning [7.138691584246846]
We propose the safe set actor-critic (SSAC) algorithm, which confines the policy update using safety-oriented energy functions.
The safety index is designed to increase rapidly for potentially dangerous actions.
We claim that we can learn the energy function in a model-free manner similar to learning a value function.
arXiv Detail & Related papers (2021-11-25T07:24:30Z) - A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states.
We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.