Policy Evaluation and Temporal-Difference Learning in Continuous Time
and Space: A Martingale Approach
- URL: http://arxiv.org/abs/2108.06655v1
- Date: Sun, 15 Aug 2021 03:37:17 GMT
- Title: Policy Evaluation and Temporal-Difference Learning in Continuous Time
and Space: A Martingale Approach
- Authors: Yanwei Jia and Xun Yu Zhou
- Abstract summary: We show that policy evaluation is equivalent to maintaining the martingale condition of a process.
We present two methods to use the martingale characterization for designing PE algorithms.
- Score: 1.776746672434207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a unified framework to study policy evaluation (PE) and the
associated temporal difference (TD) methods for reinforcement learning in
continuous time and space. We show that PE is equivalent to maintaining the
martingale condition of a process. From this perspective, we find that the
mean--square TD error approximates the quadratic variation of the martingale
and thus is not a suitable objective for PE. We present two methods to use the
martingale characterization for designing PE algorithms. The first one
minimizes a "martingale loss function", whose solution is proved to be the best
approximation of the true value function in the mean--square sense. This method
interprets the classical gradient Monte-Carlo algorithm. The second method is
based on a system of equations called the "martingale orthogonality conditions"
with "test functions". Solving these equations in different ways recovers
various classical TD algorithms, such as TD($\lambda$), LSTD, and GTD.
Different choices of test functions determine in what sense the resulting
solutions approximate the true value function. Moreover, we prove that any
convergent time-discretized algorithm converges to its continuous-time
counterpart as the mesh size goes to zero. We demonstrate the theoretical
results and corresponding algorithms with numerical experiments and
applications.
Related papers
- Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process.
For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature.
We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z) - Online Learning with Adversaries: A Differential-Inclusion Analysis [52.43460995467893]
We introduce an observation-matrix-based framework for fully asynchronous online Federated Learning with adversaries.
Our main result is that the proposed algorithm almost surely converges to the desired mean $mu.$
We derive this convergence using a novel differential-inclusion-based two-timescale analysis.
arXiv Detail & Related papers (2023-04-04T04:32:29Z) - Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both
Worlds in Stochastic and Deterministic Environments [48.96971760679639]
We study variance-dependent regret bounds for Markov decision processes (MDPs)
We propose two new environment norms to characterize the fine-grained variance properties of the environment.
For model-based methods, we design a variant of the MVP algorithm.
In particular, this bound is simultaneously minimax optimal for both and deterministic MDPs.
arXiv Detail & Related papers (2023-01-31T06:54:06Z) - Posterior and Computational Uncertainty in Gaussian Processes [52.26904059556759]
Gaussian processes scale prohibitively with the size of the dataset.
Many approximation methods have been developed, which inevitably introduce approximation error.
This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior.
We develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended.
arXiv Detail & Related papers (2022-05-30T22:16:25Z) - Temporal Difference Learning with Continuous Time and State in the
Stochastic Setting [0.0]
We consider the problem of continuous-time policy evaluation.
This consists in learning through observations the value function associated with an uncontrolled continuous-time dynamic and a reward function.
arXiv Detail & Related papers (2022-02-16T10:10:53Z) - Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD) [0.0]
Predictor-Corrector Temporal Difference (PCTD) is what I call the translated time Reinforcement(RL) algorithm from the theory of discrete time ODE.
I propose a new class of TD learning algorithms.
The parameter being approximated has a guaranteed order of magnitude reduction in the Taylor Series error of the solution to the ODE.
arXiv Detail & Related papers (2021-04-15T18:54:16Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Optimal oracle inequalities for solving projected fixed-point equations [53.31620399640334]
We study methods that use a collection of random observations to compute approximate solutions by searching over a known low-dimensional subspace of the Hilbert space.
We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation.
arXiv Detail & Related papers (2020-12-09T20:19:32Z) - Simple and optimal methods for stochastic variational inequalities, II:
Markovian noise and policy evaluation in reinforcement learning [9.359939442911127]
This paper focuses on resetting variational inequalities (VI) under Markovian noise.
A prominent application of our algorithmic developments is the policy evaluation problem in reinforcement learning.
arXiv Detail & Related papers (2020-11-15T04:05:22Z) - Proximal Gradient Temporal Difference Learning: Stable Reinforcement
Learning with Polynomial Sample Complexity [40.73281056650241]
We introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true gradient temporal difference learning algorithms.
We show how gradient TD reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function.
arXiv Detail & Related papers (2020-06-06T21:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.