Constrained Policy Optimization via Bayesian World Models
- URL: http://arxiv.org/abs/2201.09802v2
- Date: Tue, 25 Jan 2022 16:41:32 GMT
- Title: Constrained Policy Optimization via Bayesian World Models
- Authors: Yarden As, Ilnura Usmanova, Sebastian Curi, Andreas Krause
- Abstract summary: LAMBDA is a model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes.
We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.
- Score: 79.0077602277004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Improving sample-efficiency and safety are crucial challenges when deploying
reinforcement learning in high-stakes real world applications. We propose
LAMBDA, a novel model-based approach for policy optimization in safety critical
tasks modeled via constrained Markov decision processes. Our approach utilizes
Bayesian world models, and harnesses the resulting uncertainty to maximize
optimistic upper bounds on the task objective, as well as pessimistic upper
bounds on the safety constraints. We demonstrate LAMBDA's state of the art
performance on the Safety-Gym benchmark suite in terms of sample efficiency and
constraint violation.
Related papers
- Enhanced Safety in Autonomous Driving: Integrating Latent State Diffusion Model for End-to-End Navigation [5.928213664340974]
This research addresses the safety issue in the control optimization problem of autonomous driving.
We propose a novel, model-based approach for policy optimization, utilizing a conditional Value-at-Risk based Soft Actor Critic.
Our method introduces a worst-case actor to guide safe exploration, ensuring rigorous adherence to safety requirements even in unpredictable scenarios.
arXiv Detail & Related papers (2024-07-08T18:32:40Z) - FOSP: Fine-tuning Offline Safe Policy through World Models [3.7971075341023526]
Model-based Reinforcement Learning (RL) has shown its high training efficiency and capability of handling high-dimensional tasks.
However, prior works still pose safety challenges due to the online exploration in real-world deployment.
In this paper, we aim to further enhance safety during the deployment stage for vision-based robotic tasks by fine-tuning an offline-trained policy.
arXiv Detail & Related papers (2024-07-06T03:22:57Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Safe Policy Improvement in Constrained Markov Decision Processes [10.518340300810504]
We present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update.
For the former, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification.
For the latter, we introduce an algorithm ensuring that the policy is improved in a safe fashion with high-confidence guarantees.
arXiv Detail & Related papers (2022-10-20T13:29:32Z) - Meta-Learning Priors for Safe Bayesian Optimization [72.8349503901712]
We build on a meta-learning algorithm, F-PACOH, capable of providing reliable uncertainty quantification in settings of data scarcity.
As core contribution, we develop a novel framework for choosing safety-compliant priors in a data-riven manner.
On benchmark functions and a high-precision motion system, we demonstrate that our meta-learned priors accelerate the convergence of safe BO approaches.
arXiv Detail & Related papers (2022-10-03T08:38:38Z) - Safe Reinforcement Learning in Constrained Markov Decision Processes [20.175139766171277]
We propose an algorithm, SNO-MDP, that explores and optimize Markov decision processes under unknown safety constraints.
We provide theoretical guarantees on both the satisfaction of the safety constraint and the near-optimality of the cumulative reward.
arXiv Detail & Related papers (2020-08-15T02:20:23Z) - SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics.
We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations.
We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.