CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in
Coq
- URL: http://arxiv.org/abs/2009.11403v2
- Date: Tue, 15 Dec 2020 19:39:30 GMT
- Title: CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in
Coq
- Authors: Koundinya Vajjha, Avraham Shinnar, Vasily Pestun, Barry Trager, Nathan
Fulton
- Abstract summary: Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward.
This paper develops a formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes.
The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms.
- Score: 1.154957229836278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning algorithms solve sequential decision-making problems
in probabilistic environments by optimizing for long-term reward. The desire to
use reinforcement learning in safety-critical settings inspires a recent line
of work on formally constrained reinforcement learning; however, these methods
place the implementation of the learning algorithm in their Trusted Computing
Base. The crucial correctness property of these implementations is a guarantee
that the learning algorithm converges to an optimal policy. This paper begins
the work of closing this gap by developing a Coq formalization of two canonical
reinforcement learning algorithms: value and policy iteration for finite state
Markov decision processes. The central results are a formalization of Bellman's
optimality principle and its proof, which uses a contraction property of
Bellman optimality operator to establish that a sequence converges in the
infinite horizon limit. The CertRL development exemplifies how the Giry monad
and mechanized metric coinduction streamline optimality proofs for
reinforcement learning algorithms. The CertRL library provides a general
framework for proving properties about Markov decision processes and
reinforcement learning algorithms, paving the way for further work on
formalization of reinforcement learning algorithms.
Related papers
- Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time [0.6554326244334868]
We present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint.
We show that the learned policy is safe with high confidence.
We also demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.
arXiv Detail & Related papers (2024-03-23T20:22:30Z) - Regularized Q-Learning with Linear Function Approximation [2.765106384328772]
We consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation.
We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise.
arXiv Detail & Related papers (2024-01-26T20:45:40Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Stochastic convex optimization for provably efficient apprenticeship
learning [1.0609815608017066]
We consider large-scale Markov decision processes (MDPs) with an unknown cost function.
We employ convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations.
arXiv Detail & Related papers (2021-12-31T19:47:57Z) - Formalising the Foundations of Discrete Reinforcement Learning in
Isabelle/HOL [0.0]
We focus on the foundations required for dynamic programming and the use of reinforcement learning agents over such processes.
We prove the existence of a universally optimal policy where there is a discounting factor less than one.
Lastly, we prove that the value iteration and the policy algorithms work in finite time, producing an epsilon-optimal and a fully optimal policy respectively.
arXiv Detail & Related papers (2021-12-11T14:38:36Z) - A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states.
We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z) - Uniform-PAC Bounds for Reinforcement Learning with Linear Function
Approximation [92.3161051419884]
We study reinforcement learning with linear function approximation.
Existing algorithms only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees.
We propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability.
arXiv Detail & Related papers (2021-06-22T08:48:56Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Constrained Combinatorial Optimization with Reinforcement Learning [0.30938904602244344]
This paper presents a framework to tackle constrained optimization problems using deep Reinforcement Learning (RL)
We extend the Neural Combinatorial Optimization (NCO) theory in order to deal with constraints in its formulation.
In that context, the solution is iteratively constructed based on interactions with the environment.
arXiv Detail & Related papers (2020-06-22T03:13:07Z) - A Distributional Analysis of Sampling-Based Reinforcement Learning
Algorithms [67.67377846416106]
We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes.
We show that value-based methods such as TD($lambda$) and $Q$-Learning have update rules which are contractive in the space of distributions of functions.
arXiv Detail & Related papers (2020-03-27T05:13:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.