Related papers: Embedding Safety into RL: A New Take on Trust Region Methods

Embedding Safety into RL: A New Take on Trust Region Methods

URL: http://arxiv.org/abs/2411.02957v1
Date: Tue, 05 Nov 2024 09:55:50 GMT
Title: Embedding Safety into RL: A New Take on Trust Region Methods
Authors: Nikola Milosevic, Johannes Müller, Nico Scherf,
Abstract summary: Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to unsafe behaviors. We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints.
Score: 1.5733417396701983
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to producing unsafe behaviors. Constrained Markov Decision Processes (CMDPs) provide a popular framework for incorporating safety constraints. However, common solution methods often compromise reward maximization by being overly conservative or allow unsafe behavior during training. We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints and yields trust regions composed exclusively of safe policies, ensuring constraint satisfaction throughout training. We theoretically study the convergence and update properties of C-TRPO and highlight connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Finally, we demonstrate experimentally that C-TRPO significantly reduces constraint violations while achieving competitive reward maximization compared to state-of-the-art CMDP algorithms.

Related papers

Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions [7.419036996978718]
Reinforcement learning (RL) in safety-critical domains requires agents to maximise rewards while strictly adhering to safety constraints.<n>We propose Safety-Biased Trust Region Policy optimisation (SB-TRPO), a new trust-region algorithm for hard-constrained RL.
arXiv Detail & Related papers (2025-12-29T07:15:07Z)
Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning [26.75757359001632]
We propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety.<n>EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion.<n>Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.
arXiv Detail & Related papers (2025-11-06T07:51:58Z)
SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL [54.022106606140774]
We present theoretical results that provide a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setup. We also present SPoRt, which enables the user to trade off safety guarantees in exchange for task-specific performance.
arXiv Detail & Related papers (2025-04-08T19:09:07Z)
Probabilistic Shielding for Safe Reinforcement Learning [51.35559820893218]
In real-life scenarios, a Reinforcement Learning (RL) agent must often also behave in a safe manner, including at training time. We present a new, scalable method, which enjoys strict formal guarantees for Safe RL. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time.
arXiv Detail & Related papers (2025-03-09T17:54:33Z)
Flipping-based Policy for Chance-Constrained Markov Decision Processes [9.404184937255694]
This paper proposes a textitflipping-based policy for Chance-Constrained Markov Decision Processes ( CCMDPs) The flipping-based policy selects the next action by tossing a potentially distorted coin between two action candidates. We demonstrate that the flipping-based policy can improve the performance of the existing safe RL algorithms under the same limits of safety constraints.
arXiv Detail & Related papers (2024-10-09T02:00:39Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations. We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL. We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z)
Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z)
Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee. We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z)
Safety-Constrained Policy Transfer with Successor Features [19.754549649781644]
We propose a Constrained Markov Decision Process (CMDP) formulation that enables the transfer of policies and adherence to safety constraints. Our approach relies on a novel extension of generalized policy improvement to constrained settings via a Lagrangian formulation. Our experiments in simulated domains show that our approach is effective; it visits unsafe states less frequently and outperforms alternative state-of-the-art methods when taking safety constraints into account.
arXiv Detail & Related papers (2022-11-10T06:06:36Z)
Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques. We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z)
Safe Reinforcement Learning Using Advantage-Based Intervention [45.79740561754542]
Many sequential decision problems involve finding a policy that maximizes total reward while obeying safety constraints. We propose a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training. Our method comes with strong guarantees on safety during both training and deployment.
arXiv Detail & Related papers (2021-06-16T20:28:56Z)
Lyapunov Barrier Policy Optimization [15.364174084072872]
We propose a new method, LBPO, that uses a Lyapunov-based barrier function to restrict the policy update to a safe set for each training iteration. Our method also allows the user to control the conservativeness of the agent with respect to the constraints in the environment.
arXiv Detail & Related papers (2021-03-16T17:58:27Z)
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints. This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL) We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
Representation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces [72.5149277196468]
This framework involves finding a low-dimensional embedding of the policy on a kernel Hilbert space (RKHS) We derive strong theoretical guarantees on the expected return of the reconstructed policy. The results confirm that the policies can be robustly embedded in a low-dimensional space while the embedded policy incurs almost no decrease in return.
arXiv Detail & Related papers (2020-02-07T15:57:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.