Related papers: Formalising the Foundations of Discrete Reinforcement Learning in Isabelle/HOL

Formalising the Foundations of Discrete Reinforcement Learning in Isabelle/HOL

URL: http://arxiv.org/abs/2112.05996v1
Date: Sat, 11 Dec 2021 14:38:36 GMT
Title: Formalising the Foundations of Discrete Reinforcement Learning in Isabelle/HOL
Authors: Mark Chevallier and Jacques Fleuriot
Abstract summary: We focus on the foundations required for dynamic programming and the use of reinforcement learning agents over such processes. We prove the existence of a universally optimal policy where there is a discounting factor less than one. Lastly, we prove that the value iteration and the policy algorithms work in finite time, producing an epsilon-optimal and a fully optimal policy respectively.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a formalisation of finite Markov decision processes with rewards in the Isabelle theorem prover. We focus on the foundations required for dynamic programming and the use of reinforcement learning agents over such processes. In particular, we derive the Bellman equation from first principles (in both scalar and vector form), derive a vector calculation that produces the expected value of any policy p, and go on to prove the existence of a universally optimal policy where there is a discounting factor less than one. Lastly, we prove that the value iteration and the policy iteration algorithms work in finite time, producing an epsilon-optimal and a fully optimal policy respectively.

Related papers

Recursive Reward Aggregation [51.552609126905885]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z)
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Multi-objective Reinforcement Learning with Nonlinear Preferences: Provable Approximation for Maximizing Expected Scalarized Return [1.3162012586770577]
We study multi-objective reinforcement learning with nonlinear preferences over trajectories. We derive an extended form of Bellman optimality for nonlinear optimization. We show that there can be a substantial gap between the optimal policy computed by our algorithm and alternative baselines.
arXiv Detail & Related papers (2023-11-05T02:11:07Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Bellman Residual Orthogonalization for Offline Reinforcement Learning [53.17258888552998]
We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space. We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
arXiv Detail & Related papers (2022-03-24T01:04:17Z)
Stochastic convex optimization for provably efficient apprenticeship learning [1.0609815608017066]
We consider large-scale Markov decision processes (MDPs) with an unknown cost function. We employ convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations.
arXiv Detail & Related papers (2021-12-31T19:47:57Z)
A Subgame Perfect Equilibrium Reinforcement Learning Approach to Time-inconsistent Problems [4.314956204483074]
We establish a subgame perfect equilibrium reinforcement learning framework for time-inconsistent (TIC) problems. We propose a new class of algorithms, called backward policy iteration (BPI), that solves SPERL and addresses both challenges. To demonstrate the practical usage of BPI as a training framework, we adapt standard RL simulation methods and derive two BPI-based training algorithms.
arXiv Detail & Related papers (2021-10-27T09:21:35Z)
Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques. VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq [1.154957229836278]
Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. This paper develops a formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms.
arXiv Detail & Related papers (2020-09-23T22:28:17Z)
Temporal-Logic-Based Reward Shaping for Continuing Learning Tasks [57.17673320237597]
In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. This paper presents the first reward shaping framework for average-reward learning. It proves that, under standard assumptions, the optimal policy under the original reward function can be recovered.
arXiv Detail & Related papers (2020-07-03T05:06:57Z)
Kernel Taylor-Based Value Function Approximation for Continuous-State Markov Decision Processes [5.894659354028797]
We propose a principled kernel-based policy iteration algorithm to solve the continuous-state Markov Decision Processes (MDPs) We have validated the proposed method through extensive simulations in both simplified and realistic planning scenarios.
arXiv Detail & Related papers (2020-06-03T01:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.