More for Less: Safe Policy Improvement With Stronger Performance
Guarantees
- URL: http://arxiv.org/abs/2305.07958v1
- Date: Sat, 13 May 2023 16:22:21 GMT
- Title: More for Less: Safe Policy Improvement With Stronger Performance
Guarantees
- Authors: Patrick Wienh\"oft, Marnix Suilen, Thiago D. Sim\~ao, Clemens
Dubslaff, Christel Baier, Nils Jansen
- Abstract summary: The safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated.
We present a novel approach to the SPI problem that provides the means to require less data for such guarantees.
- Score: 7.507789621505201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In an offline reinforcement learning setting, the safe policy improvement
(SPI) problem aims to improve the performance of a behavior policy according to
which sample data has been generated. State-of-the-art approaches to SPI
require a high number of samples to provide practical probabilistic guarantees
on the improved policy's performance. We present a novel approach to the SPI
problem that provides the means to require less data for such guarantees.
Specifically, to prove the correctness of these guarantees, we devise implicit
transformations on the data set and the underlying environment model that serve
as theoretical foundations to derive tighter improvement bounds for SPI. Our
empirical evaluation, using the well-established SPI with baseline
bootstrapping (SPIBB) algorithm, on standard benchmarks shows that our method
indeed significantly reduces the sample complexity of the SPIBB algorithm.
Related papers
- Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.
Current approaches typically address this issue through online sampling from the target policy.
We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Risk-Averse Certification of Bayesian Neural Networks [70.44969603471903]
We propose a Risk-Averse Certification framework for Bayesian neural networks called RAC-BNN.
Our method leverages sampling and optimisation to compute a sound approximation of the output set of a BNN.
We validate RAC-BNN on a range of regression and classification benchmarks and compare its performance with a state-of-the-art method.
arXiv Detail & Related papers (2024-11-29T14:22:51Z) - Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Towards the Flatter Landscape and Better Generalization in Federated
Learning under Client-level Differential Privacy [67.33715954653098]
We propose a novel DPFL algorithm named DP-FedSAM, which leverages gradient perturbation to mitigate the negative impact of DP.
Specifically, DP-FedSAM integrates Sharpness Aware of Minimization (SAM) to generate local flatness models with stability and weight robustness.
To further reduce the magnitude random noise while achieving better performance, we propose DP-FedSAM-$top_k$ by adopting the local update sparsification technique.
arXiv Detail & Related papers (2023-05-01T15:19:09Z) - Safe Policy Improvement for POMDPs via Finite-State Controllers [6.022036788651133]
We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs)
SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner.
We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability.
arXiv Detail & Related papers (2023-01-12T11:22:54Z) - MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer
Sampling [2.501153467354696]
State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent.
They do not incorporate uncertainty in the Q-Value estimation.
This paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off.
arXiv Detail & Related papers (2022-10-24T18:55:41Z) - Safe Policy Improvement Approaches and their Limitations [2.596059386610301]
We classify various Safe Policy Improvement (SPI) approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs.
We show that their claim of being provably safe does not hold.
We develop adaptations, the Adv.-Soft-SPIBB algorithms, and show that they are provably safe.
arXiv Detail & Related papers (2022-08-01T10:13:03Z) - Safe Policy Improvement Approaches on Discrete Markov Decision Processes [2.596059386610301]
Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy.
We derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP)
arXiv Detail & Related papers (2022-01-28T15:16:54Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety
Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting.
We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z) - SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics.
We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations.
We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z) - Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states.
We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization.
Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.