Feasible Policy Iteration
- URL: http://arxiv.org/abs/2304.08845v2
- Date: Sun, 28 Jan 2024 10:13:02 GMT
- Title: Feasible Policy Iteration
- Authors: Yujie Yang, Zhilong Zheng, Shengbo Eben Li, Jingliang Duan, Jingjing
Liu, Xianyuan Zhan, Ya-Qin Zhang
- Abstract summary: We propose an indirect safe RL framework called feasible policy iteration.
It guarantees that the feasible region monotonically expands and converges to the maximum one.
Experiments show that our algorithm learns strictly safe and near-optimal policies with accurate feasible regions.
- Score: 28.29623882912745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safe reinforcement learning (RL) aims to find the optimal policy and its
feasible region in a constrained optimal control problem (OCP). Ensuring
feasibility and optimality simultaneously has been a major challenge. Existing
methods either attempt to solve OCPs directly with constrained optimization
algorithms, leading to unstable training processes and unsatisfactory
feasibility, or restrict policies in overly small feasible regions, resulting
in excessive conservativeness with sacrificed optimality. To address this
challenge, we propose an indirect safe RL framework called feasible policy
iteration, which guarantees that the feasible region monotonically expands and
converges to the maximum one, and the state-value function monotonically
improves and converges to the optimal one. We achieve this by designing a
policy update principle called region-wise policy improvement, which maximizes
the state-value function under the constraint of the constraint decay function
(CDF) inside the feasible region and minimizes the CDF outside the feasible
region simultaneously. This update scheme ensures that the state-value function
monotonically increases state-wise in the feasible region and the CDF
monotonically decreases state-wise in the entire state space. We prove that the
CDF converges to the solution of the risky Bellman equation while the
state-value function converges to the solution of the feasible Bellman
equation. The former represents the maximum feasible region and the latter
manifests the optimal state-value function. Experiments show that our algorithm
learns strictly safe and near-optimal policies with accurate feasible regions
on classic control tasks. It also achieves fewer constraint violations with
performance better than (or comparable to) baselines on Safety Gym.
Related papers
- Embedding Safety into RL: A New Take on Trust Region Methods [1.5733417396701983]
Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to unsafe behaviors.
We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints.
arXiv Detail & Related papers (2024-11-05T09:55:50Z) - Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations.
We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL.
We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods [21.950484108431944]
Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning.
We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions.
We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
arXiv Detail & Related papers (2023-06-25T05:41:38Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Chance Constrained Policy Optimization for Process Control and
Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation.
We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.