Related papers: Safe Policy Improvement Approaches on Discrete Markov Decision Processes

Safe Policy Improvement Approaches on Discrete Markov Decision Processes

URL: http://arxiv.org/abs/2201.12175v1
Date: Fri, 28 Jan 2022 15:16:54 GMT
Title: Safe Policy Improvement Approaches on Discrete Markov Decision Processes
Authors: Philipp Scholl, Felix Dietrich, Clemens Otte, Steffen Udluft
Abstract summary: Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. We derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP)
Score: 2.596059386610301
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Planning and Learning in Average Risk-aware MDPs [4.696083734269232]
We extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Both the RVI and Q-learning algorithms are proven to converge to optimality. Our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
arXiv Detail & Related papers (2025-03-22T03:18:09Z)
SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization [1.3597551064547502]
This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization. In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints. Our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards.
arXiv Detail & Related papers (2023-11-01T22:12:50Z)
Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states. The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z)
Bayesian Safe Policy Learning with Chance Constrained Optimization: Application to Military Security Assessment during the Vietnam War [0.0]
We investigate whether it would have been possible to improve a security assessment algorithm employed during the Vietnam War. This empirical application raises several methodological challenges that frequently arise in high-stakes algorithmic decision-making.
arXiv Detail & Related papers (2023-07-17T20:59:50Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
More for Less: Safe Policy Improvement With Stronger Performance Guarantees [7.507789621505201]
The safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated. We present a novel approach to the SPI problem that provides the means to require less data for such guarantees.
arXiv Detail & Related papers (2023-05-13T16:22:21Z)
Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z)
Safe Policy Improvement Approaches and their Limitations [2.596059386610301]
We classify various Safe Policy Improvement (SPI) approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. We show that their claim of being provably safe does not hold. We develop adaptations, the Adv.-Soft-SPIBB algorithms, and show that they are provably safe.
arXiv Detail & Related papers (2022-08-01T10:13:03Z)
Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting. We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z)
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms. Specifically, we investigate the consequences of "code-level optimizations:" Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.