Towards Safe Policy Improvement for Non-Stationary MDPs
- URL: http://arxiv.org/abs/2010.12645v2
- Date: Thu, 17 Dec 2020 20:26:20 GMT
- Title: Towards Safe Policy Improvement for Non-Stationary MDPs
- Authors: Yash Chandak, Scott M. Jordan, Georgios Theocharous, Martha White,
Philip S. Thomas
- Abstract summary: Many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable.
We take the first steps towards ensuring safety, with high confidence, for smoothly-varying non-stationary decision problems.
Our proposed method extends a type of safe algorithm, called a Seldonian algorithm, through a synthesis of model-free reinforcement learning with time-series analysis.
- Score: 48.9966576179679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many real-world sequential decision-making problems involve critical systems
with financial risks and human-life risks. While several works in the past have
proposed methods that are safe for deployment, they assume that the underlying
problem is stationary. However, many real-world problems of interest exhibit
non-stationarity, and when stakes are high, the cost associated with a false
stationarity assumption may be unacceptable. We take the first steps towards
ensuring safety, with high confidence, for smoothly-varying non-stationary
decision problems. Our proposed method extends a type of safe algorithm, called
a Seldonian algorithm, through a synthesis of model-free reinforcement learning
with time-series analysis. Safety is ensured using sequential hypothesis
testing of a policy's forecasted performance, and confidence intervals are
obtained using wild bootstrap.
Related papers
- Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models [57.006252510102506]
Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications.<n>We introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems.
arXiv Detail & Related papers (2026-02-12T22:03:35Z) - Probabilistic Safety Guarantee for Stochastic Control Systems Using Average Reward MDPs [8.872171447378685]
We present a new algorithm that computes safe policies to determine the safety level across a finite state set.<n>This algorithm reduces the safety objective to the standard average reward Markov Decision Process (MDP) objective.<n>Results indicate that the average-reward MDPs solution is more comprehensive, converges faster, and offers higher quality compared to the minimum discounted-reward solution.
arXiv Detail & Related papers (2025-11-11T16:29:40Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time [0.6554326244334868]
We present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint.
We show that the learned policy is safe with high confidence.
We also demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.
arXiv Detail & Related papers (2024-03-23T20:22:30Z) - Information-Theoretic Safe Bayesian Optimization [59.758009422067005]
We consider a sequential decision making task, where the goal is to optimize an unknown function without evaluating parameters that violate an unknown (safety) constraint.
Most current methods rely on a discretization of the domain and cannot be directly extended to the continuous case.
We propose an information-theoretic safe exploration criterion that directly exploits the GP posterior to identify the most informative safe parameters to evaluate.
arXiv Detail & Related papers (2024-02-23T14:31:10Z) - One step closer to unbiased aleatoric uncertainty estimation [71.55174353766289]
We propose a new estimation method by actively de-noising the observed data.
By conducting a broad range of experiments, we demonstrate that our proposed approach provides a much closer approximation to the actual data uncertainty than the standard method.
arXiv Detail & Related papers (2023-12-16T14:59:11Z) - Information-Theoretic Safe Exploration with Gaussian Processes [89.31922008981735]
We consider a sequential decision making task where we are not allowed to evaluate parameters that violate an unknown (safety) constraint.
Most current methods rely on a discretization of the domain and cannot be directly extended to the continuous case.
We propose an information-theoretic safe exploration criterion that directly exploits the GP posterior to identify the most informative safe parameters to evaluate.
arXiv Detail & Related papers (2022-12-09T15:23:58Z) - Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers.
We then present the pointwise feasibility conditions of the resulting safety controller.
We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Context-Aware Safe Reinforcement Learning for Non-Stationary
Environments [24.75527261989899]
Safety is a critical concern when deploying reinforcement learning agents for realistic tasks.
We propose the context-aware safe reinforcement learning (CASRL) method to realize safe adaptation in non-stationary environments.
Results show that the proposed algorithm significantly outperforms existing baselines in terms of safety and robustness.
arXiv Detail & Related papers (2021-01-02T23:52:22Z) - Efficient falsification approach for autonomous vehicle validation using
a parameter optimisation technique based on reinforcement learning [6.198523595657983]
The widescale deployment of Autonomous Vehicles (AV) appears to be imminent despite many safety challenges that are yet to be resolved.
The uncertainties in the behaviour of the traffic participants and the dynamic world cause reactions in advanced autonomous systems.
This paper presents an efficient falsification method to evaluate the System Under Test.
arXiv Detail & Related papers (2020-11-16T02:56:13Z) - Verifiably Safe Exploration for End-to-End Reinforcement Learning [17.401496872603943]
This paper contributes a first approach toward enforcing formal safety constraints on end-to-end policies with visual inputs.
It is evaluated on a novel benchmark that emphasizes the challenge of safely exploring in the presence of hard constraints.
arXiv Detail & Related papers (2020-07-02T16:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.