Embedding Safety into RL: A New Take on Trust Region Methods
- URL: http://arxiv.org/abs/2411.02957v2
- Date: Tue, 04 Feb 2025 11:16:42 GMT
- Title: Embedding Safety into RL: A New Take on Trust Region Methods
- Authors: Nikola Milosevic, Johannes Müller, Nico Scherf,
- Abstract summary: We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes policy space to ensure trust regions contain only safe policies.
Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.
- Score: 1.5733417396701983
- License:
- Abstract: Reinforcement Learning (RL) agents can solve diverse tasks but often exhibit unsafe behavior. Constrained Markov Decision Processes (CMDPs) address this by enforcing safety constraints, yet existing methods either sacrifice reward maximization or allow unsafe training. We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes the policy space geometry to ensure trust regions contain only safe policies, guaranteeing constraint satisfaction throughout training. We analyze its theoretical properties and connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.
Related papers
- Flipping-based Policy for Chance-Constrained Markov Decision Processes [9.404184937255694]
This paper proposes a textitflipping-based policy for Chance-Constrained Markov Decision Processes ( CCMDPs)
The flipping-based policy selects the next action by tossing a potentially distorted coin between two action candidates.
We demonstrate that the flipping-based policy can improve the performance of the existing safe RL algorithms under the same limits of safety constraints.
arXiv Detail & Related papers (2024-10-09T02:00:39Z) - Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations.
We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL.
We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Safety Correction from Baseline: Towards the Risk-aware Policy in
Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent.
Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control.
The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z) - Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques.
We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z) - Safe Reinforcement Learning Using Advantage-Based Intervention [45.79740561754542]
Many sequential decision problems involve finding a policy that maximizes total reward while obeying safety constraints.
We propose a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training.
Our method comes with strong guarantees on safety during both training and deployment.
arXiv Detail & Related papers (2021-06-16T20:28:56Z) - Lyapunov Barrier Policy Optimization [15.364174084072872]
We propose a new method, LBPO, that uses a Lyapunov-based barrier function to restrict the policy update to a safe set for each training iteration.
Our method also allows the user to control the conservativeness of the agent with respect to the constraints in the environment.
arXiv Detail & Related papers (2021-03-16T17:58:27Z) - Cautious Reinforcement Learning with Logical Constraints [78.96597639789279]
An adaptive safe padding forces Reinforcement Learning (RL) to synthesise optimal control policies while ensuring safety during the learning process.
Theoretical guarantees are available on the optimality of the synthesised policies and on the convergence of the learning algorithm.
arXiv Detail & Related papers (2020-02-26T00:01:08Z) - Representation of Reinforcement Learning Policies in Reproducing Kernel
Hilbert Spaces [72.5149277196468]
This framework involves finding a low-dimensional embedding of the policy on a kernel Hilbert space (RKHS)
We derive strong theoretical guarantees on the expected return of the reconstructed policy.
The results confirm that the policies can be robustly embedded in a low-dimensional space while the embedded policy incurs almost no decrease in return.
arXiv Detail & Related papers (2020-02-07T15:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.