Safe Policy Improvement Approaches on Discrete Markov Decision Processes
- URL: http://arxiv.org/abs/2201.12175v1
- Date: Fri, 28 Jan 2022 15:16:54 GMT
- Title: Safe Policy Improvement Approaches on Discrete Markov Decision Processes
- Authors: Philipp Scholl, Felix Dietrich, Clemens Otte, Steffen Udluft
- Abstract summary: Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy.
We derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP)
- Score: 2.596059386610301
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Safe Policy Improvement (SPI) aims at provable guarantees that a learned
policy is at least approximately as good as a given baseline policy. Building
on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we
identify theoretical issues in their approach, provide a corrected theory, and
derive a new algorithm that is provably safe on finite Markov Decision
Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits
the best performance among many state of the art SPI algorithms on two
different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms
and empirically show an interesting property of two classes of SPI algorithms:
while the mean performance of algorithms that incorporate the uncertainty as a
penalty on the action-value is higher, actively restricting the set of policies
more consistently produces good policies and is, thus, safer.
Related papers
- Planning and Learning in Average Risk-aware MDPs [4.696083734269232]
We extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures.
Both the RVI and Q-learning algorithms are proven to converge to optimality.
Our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
arXiv Detail & Related papers (2025-03-22T03:18:09Z) - SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization [1.3597551064547502]
This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization.
In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints.
Our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards.
arXiv Detail & Related papers (2023-11-01T22:12:50Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Bayesian Safe Policy Learning with Chance Constrained Optimization: Application to Military Security Assessment during the Vietnam War [0.0]
We investigate whether it would have been possible to improve a security assessment algorithm employed during the Vietnam War.
This empirical application raises several methodological challenges that frequently arise in high-stakes algorithmic decision-making.
arXiv Detail & Related papers (2023-07-17T20:59:50Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - More for Less: Safe Policy Improvement With Stronger Performance
Guarantees [7.507789621505201]
The safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated.
We present a novel approach to the SPI problem that provides the means to require less data for such guarantees.
arXiv Detail & Related papers (2023-05-13T16:22:21Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Safe Policy Improvement Approaches and their Limitations [2.596059386610301]
We classify various Safe Policy Improvement (SPI) approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs.
We show that their claim of being provably safe does not hold.
We develop adaptations, the Adv.-Soft-SPIBB algorithms, and show that they are provably safe.
arXiv Detail & Related papers (2022-08-01T10:13:03Z) - Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting.
We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Robust Reinforcement Learning using Least Squares Policy Iteration with
Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces.
We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation.
We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.