Distillation of RL Policies with Formal Guarantees via Variational
Abstraction of Markov Decision Processes (Technical Report)
- URL: http://arxiv.org/abs/2112.09655v1
- Date: Fri, 17 Dec 2021 17:57:32 GMT
- Title: Distillation of RL Policies with Formal Guarantees via Variational
Abstraction of Markov Decision Processes (Technical Report)
- Authors: Florent Delgrange, Ann Now\'e, Guillermo A. P\'erez
- Abstract summary: We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL)
We derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it.
We show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the challenge of policy simplification and verification in the
context of policies learned through reinforcement learning (RL) in continuous
environments. In well-behaved settings, RL algorithms have convergence
guarantees in the limit. While these guarantees are valuable, they are
insufficient for safety-critical applications. Furthermore, they are lost when
applying advanced techniques such as deep-RL. To recover guarantees when
applying advanced RL algorithms to more complex environments with (i)
reachability, (ii) safety-constrained reachability, or (iii) discounted-reward
objectives, we build upon the DeepMDP framework introduced by Gelada et al. to
derive new bisimulation bounds between the unknown environment and a learned
discrete latent model of it. Our bisimulation bounds enable the application of
formal methods for Markov decision processes. Finally, we show how one can use
a policy obtained via state-of-the-art RL to efficiently train a variational
autoencoder that yields a discrete latent model with provably approximately
correct bisimulation guarantees. Additionally, we obtain a distilled version of
the policy for the latent model.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently
Distilled RL Policies with Many-sided Guarantees [0.0]
Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling verifiable controllers from any RL policy.
We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy.
Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general.
arXiv Detail & Related papers (2023-03-22T13:41:42Z) - Safe Reinforcement Learning via Confidence-Based Filters [78.39359694273575]
We develop a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard reinforcement learning techniques.
We provide formal safety guarantees, and empirically demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-07-04T11:43:23Z) - Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation.
By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z) - Lyapunov-based uncertainty-aware safe reinforcement learning [0.0]
InReinforcement learning (RL) has shown a promising performance in learning optimal policies for a variety of sequential decision-making tasks.
In many real-world RL problems, besides optimizing the main objectives, the agent is expected to satisfy a certain level of safety.
We propose a Lyapunov-based uncertainty-aware safe RL model to address these limitations.
arXiv Detail & Related papers (2021-07-29T13:08:15Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z) - Constrained Markov Decision Processes via Backward Value Functions [43.649330976089004]
We model the problem of learning with constraints as a Constrained Markov Decision Process.
A key contribution of our approach is to translate cumulative cost constraints into state-based constraints.
We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training.
arXiv Detail & Related papers (2020-08-26T20:56:16Z) - Learning with Safety Constraints: Sample Complexity of Reinforcement
Learning for Constrained MDPs [13.922754427601491]
We characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy.
Our main finding is that compared to the best known bounds of the unconstrained regime, the sample of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints.
arXiv Detail & Related papers (2020-08-01T18:17:08Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.