Feasible Policy Iteration
- URL: http://arxiv.org/abs/2304.08845v2
- Date: Sun, 28 Jan 2024 10:13:02 GMT
- Title: Feasible Policy Iteration
- Authors: Yujie Yang, Zhilong Zheng, Shengbo Eben Li, Jingliang Duan, Jingjing
Liu, Xianyuan Zhan, Ya-Qin Zhang
- Abstract summary: We propose an indirect safe RL framework called feasible policy iteration.
It guarantees that the feasible region monotonically expands and converges to the maximum one.
Experiments show that our algorithm learns strictly safe and near-optimal policies with accurate feasible regions.
- Score: 28.29623882912745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safe reinforcement learning (RL) aims to find the optimal policy and its
feasible region in a constrained optimal control problem (OCP). Ensuring
feasibility and optimality simultaneously has been a major challenge. Existing
methods either attempt to solve OCPs directly with constrained optimization
algorithms, leading to unstable training processes and unsatisfactory
feasibility, or restrict policies in overly small feasible regions, resulting
in excessive conservativeness with sacrificed optimality. To address this
challenge, we propose an indirect safe RL framework called feasible policy
iteration, which guarantees that the feasible region monotonically expands and
converges to the maximum one, and the state-value function monotonically
improves and converges to the optimal one. We achieve this by designing a
policy update principle called region-wise policy improvement, which maximizes
the state-value function under the constraint of the constraint decay function
(CDF) inside the feasible region and minimizes the CDF outside the feasible
region simultaneously. This update scheme ensures that the state-value function
monotonically increases state-wise in the feasible region and the CDF
monotonically decreases state-wise in the entire state space. We prove that the
CDF converges to the solution of the risky Bellman equation while the
state-value function converges to the solution of the feasible Bellman
equation. The former represents the maximum feasible region and the latter
manifests the optimal state-value function. Experiments show that our algorithm
learns strictly safe and near-optimal policies with accurate feasible regions
on classic control tasks. It also achieves fewer constraint violations with
performance better than (or comparable to) baselines on Safety Gym.
Related papers
- Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy.
Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z) - Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations.
We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL.
We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods [21.950484108431944]
Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning.
We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions.
We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
arXiv Detail & Related papers (2023-06-25T05:41:38Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Trust Region Policy Optimization with Optimal Transport Discrepancies:
Duality and Algorithm for Continuous Actions [5.820284464296154]
Trust Region Policy Optimization is a popular approach to stabilize the policy updates.
We propose a novel algorithm - Optimal Transport Trust Region Policy Optimization (OT-TRPO) - for continuous state-action spaces.
Our results show that optimal transport discrepancies can offer an advantage over state-of-the-art approaches.
arXiv Detail & Related papers (2022-10-20T10:04:35Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Chance Constrained Policy Optimization for Process Control and
Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation.
We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.