Formalising the Foundations of Discrete Reinforcement Learning in
Isabelle/HOL
- URL: http://arxiv.org/abs/2112.05996v1
- Date: Sat, 11 Dec 2021 14:38:36 GMT
- Title: Formalising the Foundations of Discrete Reinforcement Learning in
Isabelle/HOL
- Authors: Mark Chevallier and Jacques Fleuriot
- Abstract summary: We focus on the foundations required for dynamic programming and the use of reinforcement learning agents over such processes.
We prove the existence of a universally optimal policy where there is a discounting factor less than one.
Lastly, we prove that the value iteration and the policy algorithms work in finite time, producing an epsilon-optimal and a fully optimal policy respectively.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a formalisation of finite Markov decision processes with rewards
in the Isabelle theorem prover. We focus on the foundations required for
dynamic programming and the use of reinforcement learning agents over such
processes. In particular, we derive the Bellman equation from first principles
(in both scalar and vector form), derive a vector calculation that produces the
expected value of any policy p, and go on to prove the existence of a
universally optimal policy where there is a discounting factor less than one.
Lastly, we prove that the value iteration and the policy iteration algorithms
work in finite time, producing an epsilon-optimal and a fully optimal policy
respectively.
Related papers
- Multi-objective Reinforcement Learning with Nonlinear Preferences: Provable Approximation for Maximizing Expected Scalarized Return [1.3162012586770577]
We study multi-objective reinforcement learning with nonlinear preferences over trajectories.
We derive an extended form of Bellman optimality for nonlinear optimization.
We show that there can be a substantial gap between the optimal policy computed by our algorithm and alternative baselines.
arXiv Detail & Related papers (2023-11-05T02:11:07Z) - Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal.
We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths.
We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Bellman Residual Orthogonalization for Offline Reinforcement Learning [53.17258888552998]
We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space.
We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
arXiv Detail & Related papers (2022-03-24T01:04:17Z) - Stochastic convex optimization for provably efficient apprenticeship
learning [1.0609815608017066]
We consider large-scale Markov decision processes (MDPs) with an unknown cost function.
We employ convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations.
arXiv Detail & Related papers (2021-12-31T19:47:57Z) - A Subgame Perfect Equilibrium Reinforcement Learning Approach to
Time-inconsistent Problems [4.314956204483074]
We establish a subgame perfect equilibrium reinforcement learning framework for time-inconsistent (TIC) problems.
We propose a new class of algorithms, called backward policy iteration (BPI), that solves SPERL and addresses both challenges.
To demonstrate the practical usage of BPI as a training framework, we adapt standard RL simulation methods and derive two BPI-based training algorithms.
arXiv Detail & Related papers (2021-10-27T09:21:35Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in
Coq [1.154957229836278]
Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward.
This paper develops a formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes.
The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms.
arXiv Detail & Related papers (2020-09-23T22:28:17Z) - Temporal-Logic-Based Reward Shaping for Continuing Learning Tasks [57.17673320237597]
In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation.
This paper presents the first reward shaping framework for average-reward learning.
It proves that, under standard assumptions, the optimal policy under the original reward function can be recovered.
arXiv Detail & Related papers (2020-07-03T05:06:57Z) - Kernel Taylor-Based Value Function Approximation for Continuous-State
Markov Decision Processes [5.894659354028797]
We propose a principled kernel-based policy iteration algorithm to solve the continuous-state Markov Decision Processes (MDPs)
We have validated the proposed method through extensive simulations in both simplified and realistic planning scenarios.
arXiv Detail & Related papers (2020-06-03T01:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.