Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently
Distilled RL Policies with Many-sided Guarantees
- URL: http://arxiv.org/abs/2303.12558v2
- Date: Fri, 21 Apr 2023 13:54:19 GMT
- Title: Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently
Distilled RL Policies with Many-sided Guarantees
- Authors: Florent Delgrange, Ann Now\'e, Guillermo A. P\'erez
- Abstract summary: Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling verifiable controllers from any RL policy.
We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy.
Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although deep reinforcement learning (DRL) has many success stories, the
large-scale deployment of policies learned through these advanced techniques in
safety-critical scenarios is hindered by their lack of formal guarantees.
Variational Markov Decision Processes (VAE-MDPs) are discrete latent space
models that provide a reliable framework for distilling formally verifiable
controllers from any RL policy. While the related guarantees address relevant
practical aspects such as the satisfaction of performance and safety
properties, the VAE approach suffers from several learning flaws (posterior
collapse, slow learning speed, poor dynamics estimates), primarily due to the
absence of abstraction and representation guarantees to support latent
optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent
space model that fixes those issues by minimizing a penalized form of the
optimal transport between the behaviors of the agent executing the original
policy and the distilled policy, for which the formal guarantees apply. Our
approach yields bisimulation guarantees while learning the distilled policy,
allowing concrete optimization of the abstraction and representation model
quality. Our experiments show that, besides distilling policies up to 10 times
faster, the latent model quality is indeed better in general. Moreover, we
present experiments from a simple time-to-failure verification algorithm on the
latent space. The fact that our approach enables such simple verification
techniques highlights its applicability.
Related papers
- Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters.
We learn and analyse IMDPs for a set of unknown sample environments induced by parameters.
We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z) - Safe Policy Improvement in Constrained Markov Decision Processes [10.518340300810504]
We present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update.
For the former, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification.
For the latter, we introduce an algorithm ensuring that the policy is improved in a safe fashion with high-confidence guarantees.
arXiv Detail & Related papers (2022-10-20T13:29:32Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques.
We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - Joint Differentiable Optimization and Verification for Certified
Reinforcement Learning [91.93635157885055]
In model-based reinforcement learning for safety-critical control systems, it is important to formally certify system properties.
We propose a framework that jointly conducts reinforcement learning and formal verification.
arXiv Detail & Related papers (2022-01-28T16:53:56Z) - Distillation of RL Policies with Formal Guarantees via Variational
Abstraction of Markov Decision Processes (Technical Report) [0.0]
We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL)
We derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it.
We show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees.
arXiv Detail & Related papers (2021-12-17T17:57:32Z) - Pessimistic Model Selection for Offline Deep Reinforcement Learning [56.282483586473816]
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications.
One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL.
We propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee.
arXiv Detail & Related papers (2021-11-29T06:29:49Z) - Lyapunov-based uncertainty-aware safe reinforcement learning [0.0]
InReinforcement learning (RL) has shown a promising performance in learning optimal policies for a variety of sequential decision-making tasks.
In many real-world RL problems, besides optimizing the main objectives, the agent is expected to satisfy a certain level of safety.
We propose a Lyapunov-based uncertainty-aware safe RL model to address these limitations.
arXiv Detail & Related papers (2021-07-29T13:08:15Z) - An Abstraction-based Method to Verify Multi-Agent Deep
Reinforcement-Learning Behaviours [8.95294551927446]
Multi-agent reinforcement learning (RL) often struggles to ensure the safe behaviours of the learning agents.
We present a methodology that combines formal verification with (deep) RL algorithms to guarantee the satisfaction of formally-specified safety constraints.
arXiv Detail & Related papers (2021-02-02T11:12:30Z) - Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization
under Model Uncertainty [9.246374019271935]
We propose to merge the theory of constrained Markov decision process (CMDP) with the theory of robust Markov decision process (RMDP)
This formulation allows us to design RL algorithms that are robust in performance, and provides constraint satisfaction guarantees.
We first propose the general problem formulation under the concept of RCMDP, and then propose a Lagrangian formulation of the optimal problem, leading to a robust-constrained policy gradient RL algorithm.
arXiv Detail & Related papers (2020-10-10T01:53:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.