Related papers: SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL

SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL

URL: http://arxiv.org/abs/2504.06386v2
Date: Mon, 23 Jun 2025 10:50:00 GMT
Title: SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL
Authors: Jacques Cloete, Nikolaus Vertovec, Alessandro Abate,
Abstract summary: We present theoretical results that place a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setting.<n>This bound can be applied to temporally-extended properties (beyond safety) and to robust control problems.<n>We present experimental results demonstrating this trade-off and comparing the theoretical bound to posterior bounds derived from empirical violation rates.
Score: 54.022106606140774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To apply reinforcement learning to safety-critical applications, we ought to provide safety guarantees during both policy training and deployment. In this work, we present theoretical results that place a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setting. This bound, based on a maximum policy ratio computed with respect to a 'safe' base policy, can also be applied to temporally-extended properties (beyond safety) and to robust control problems. To utilize these results, we introduce SPoRt, which provides a data-driven method for computing this bound for the base policy using the scenario approach, and includes Projected PPO, a new projection-based approach for training the task-specific policy while maintaining a user-specified bound on property violation. SPoRt thus enables users to trade off safety guarantees against task-specific performance. Complementing our theoretical results, we present experimental results demonstrating this trade-off and comparing the theoretical bound to posterior bounds derived from empirical violation rates.

Related papers

Reinforcement Learning with Continuous Actions Under Unmeasured Confounding [14.510042451844766]
This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces.<n>We develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy.<n>We provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy.
arXiv Detail & Related papers (2025-05-01T04:55:29Z)
Learning Verifiable Control Policies Using Relaxed Verification [49.81690518952909]
This work proposes to perform verification throughout training to aim for policies whose properties can be evaluated throughout runtime.<n>The approach is to use differentiable reachability analysis and incorporate new components into the loss function.
arXiv Detail & Related papers (2025-04-23T16:54:35Z)
Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning [7.888219789657414]
In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints.<n>We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders.<n>We frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints.
arXiv Detail & Related papers (2024-12-11T22:00:07Z)
Embedding Safety into RL: A New Take on Trust Region Methods [1.5733417396701983]
We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes policy space to ensure trust regions contain only safe policies.<n>Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.
arXiv Detail & Related papers (2024-11-05T09:55:50Z)
Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations. We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL. We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z)
Compositional Policy Learning in Stochastic Control Systems with Formal Guarantees [0.0]
Reinforcement learning has shown promising results in learning neural network policies for complicated control tasks. We propose a novel method for learning a composition of neural network policies in environments. A formal certificate guarantees that a specification over the policy's behavior is satisfied with the desired probability.
arXiv Detail & Related papers (2023-12-03T17:04:18Z)
Offline Reinforcement Learning with On-Policy Q-Function Regularization [57.09073809901382]
We deal with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
arXiv Detail & Related papers (2023-07-25T21:38:08Z)
Probabilistic Constraint for Safety-Critical Reinforcement Learning [13.502008069967552]
We consider the problem of learning safe policies for probabilistic-constrained reinforcement learning (RL) We provide an improved gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE. We propose a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe policies.
arXiv Detail & Related papers (2023-06-29T19:41:56Z)
Trust Region-Based Safe Distributional Reinforcement Learning for Multiple Constraints [18.064813206191754]
We propose a trust region-based safe reinforcement learning algorithm for multiple constraints called a safe distributional actor-critic (SDAC) Our main contributions are as follows: 1) introducing a gradient integration method to manage infeasibility issues in multi-constrained problems, ensuring theoretical convergence, and 2) developing a TD($lambda$) target distribution to estimate risk-averse constraints with low biases.
arXiv Detail & Related papers (2023-01-26T04:05:40Z)
Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent. Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control. The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z)
Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques. We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)
Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation [21.703965401500913]
We propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and 3) we propose a way to interpret ESRL's policy at every state through posterior distributions.
arXiv Detail & Related papers (2020-06-23T17:43:44Z)
Cautious Reinforcement Learning with Logical Constraints [78.96597639789279]
An adaptive safe padding forces Reinforcement Learning (RL) to synthesise optimal control policies while ensuring safety during the learning process. Theoretical guarantees are available on the optimality of the synthesised policies and on the convergence of the learning algorithm.
arXiv Detail & Related papers (2020-02-26T00:01:08Z)
BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy. We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.