Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs
- URL: http://arxiv.org/abs/2106.00099v1
- Date: Mon, 31 May 2021 21:04:21 GMT
- Title: Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs
- Authors: Harsh Satija, Philip S. Thomas, Joelle Pineau, Romain Laroche
- Abstract summary: We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting.
We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
- Score: 71.47895794305883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of Safe Policy Improvement (SPI) under constraints in
the offline Reinforcement Learning (RL) setting. We consider the scenario
where: (i) we have a dataset collected under a known baseline policy, (ii)
multiple reward signals are received from the environment inducing as many
objectives to optimize. We present an SPI formulation for this RL setting that
takes into account the preferences of the algorithm's user for handling the
trade-offs for different reward signals while ensuring that the new policy
performs at least as well as the baseline policy along each individual
objective. We build on traditional SPI algorithms and propose a novel method
based on Safe Policy Iteration with Baseline Bootstrapping (SPIBB, Laroche et
al., 2019) that provides high probability guarantees on the performance of the
agent in the true environment. We show the effectiveness of our method on a
synthetic grid-world safety task as well as in a real-world critical care
context to learn a policy for the administration of IV fluids and vasopressors
to treat sepsis.
Related papers
- CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies [30.57323631122579]
We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising.
Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements.
We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level.
arXiv Detail & Related papers (2024-08-21T21:38:03Z) - Offline Goal-Conditioned Reinforcement Learning for Safety-Critical
Tasks with Recovery Policy [4.854443247023496]
offline goal-conditioned reinforcement learning (GCRL) aims at solving goal-reaching tasks with sparse rewards from an offline dataset.
We propose a new method called Recovery-based Supervised Learning (RbSL) to accomplish safety-critical tasks with various goals.
arXiv Detail & Related papers (2024-03-04T05:20:57Z) - Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning [9.341618348621662]
We aim to find the best-performing policy within a limited budget of online interactions.
We first study the major online RL exploration methods based on intrinsic rewards and UCB.
We then introduce an algorithm for planning to go out-of-distribution that avoids these issues.
arXiv Detail & Related papers (2023-10-09T13:47:05Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - More for Less: Safe Policy Improvement With Stronger Performance
Guarantees [7.507789621505201]
The safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated.
We present a novel approach to the SPI problem that provides the means to require less data for such guarantees.
arXiv Detail & Related papers (2023-05-13T16:22:21Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning.
We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment.
We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.