Feasible Policy Iteration for Safe Reinforcement Learning
- URL: http://arxiv.org/abs/2304.08845v3
- Date: Thu, 13 Mar 2025 09:53:02 GMT
- Title: Feasible Policy Iteration for Safe Reinforcement Learning
- Authors: Yujie Yang, Zhilong Zheng, Shengbo Eben Li, Wei Xu, Jingjing Liu, Xianyuan Zhan, Ya-Qin Zhang,
- Abstract summary: Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems.<n>We propose feasible policy iteration (FPI), the first foundational dynamic programming algorithm for safe RL.<n> Experimental results demonstrate that FPI achieves strictly zero constraint violation on low-dimensional tasks.
- Score: 29.662547846929847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems. While policy iteration provides a fundamental algorithm for standard RL, an analogous theoretical algorithm for safe RL remains absent. In this paper, we propose feasible policy iteration (FPI), the first foundational dynamic programming algorithm for safe RL. FPI alternates between policy evaluation, region identification and policy improvement. This follows actor-critic-scenery (ACS) framework where scenery refers to a feasibility function that represents a feasible region. A region-wise update rule is developed for the policy improvement step, which maximizes state-value function inside the feasible region and minimizes feasibility function outside it. With this update rule, FPI guarantees monotonic expansion of feasible region, monotonic improvement of state-value function, and geometric convergence to the optimal safe policy. Experimental results demonstrate that FPI achieves strictly zero constraint violation on low-dimensional tasks and outperforms existing methods in constraint adherence and reward performance on high-dimensional tasks.
Related papers
- SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL [54.022106606140774]
We present theoretical results that provide a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setup.
We also present SPoRt, which enables the user to trade off safety guarantees in exchange for task-specific performance.
arXiv Detail & Related papers (2025-04-08T19:09:07Z) - Embedding Safety into RL: A New Take on Trust Region Methods [1.5733417396701983]
Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to unsafe behaviors.
We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints.
arXiv Detail & Related papers (2024-11-05T09:55:50Z) - Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations.
We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL.
We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods [21.950484108431944]
Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning.
We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions.
We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
arXiv Detail & Related papers (2023-06-25T05:41:38Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Cautious Policy Programming: Exploiting KL Regularization in Monotonic
Policy Improvement for Reinforcement Learning [11.82492300303637]
We propose a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning.
We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
arXiv Detail & Related papers (2021-07-13T01:03:10Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based
Reinforcement Learning [14.325835899564664]
entropy-regularized value-based reinforcement learning method can ensure the monotonic improvement of policies at each policy update.
We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.
arXiv Detail & Related papers (2020-08-25T04:09:18Z) - Chance Constrained Policy Optimization for Process Control and
Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation.
We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z) - Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms.
We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization.
Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.