Reducing Sampling Error in Batch Temporal Difference Learning
- URL: http://arxiv.org/abs/2008.06738v1
- Date: Sat, 15 Aug 2020 15:30:06 GMT
- Title: Reducing Sampling Error in Batch Temporal Difference Learning
- Authors: Brahma Pavse, Ishan Durugkar, Josiah Hanna, Peter Stone
- Abstract summary: Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning.
This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data.
- Score: 42.30708351947417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal difference (TD) learning is one of the main foundations of modern
reinforcement learning. This paper studies the use of TD(0), a canonical TD
algorithm, to estimate the value function of a given policy from a batch of
data. In this batch setting, we show that TD(0) may converge to an inaccurate
value function because the update following an action is weighted according to
the number of times that action occurred in the batch -- not the true
probability of the action under the given policy. To address this limitation,
we introduce \textit{policy sampling error corrected}-TD(0) (PSEC-TD(0)).
PSEC-TD(0) first estimates the empirical distribution of actions in each state
in the batch and then uses importance sampling to correct for the mismatch
between the empirical weighting and the correct weighting for updates following
each action. We refine the concept of a certainty-equivalence estimate and
argue that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed
batch of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on
three batch value function learning tasks, with a hyperparameter sensitivity
analysis, and show that PSEC-TD(0) produces value function estimates with lower
mean squared error than TD(0).
Related papers
- Discerning Temporal Difference Learning [5.439020425819001]
Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL)
We propose a novel TD algorithm named discerning TD learning (DTD)
arXiv Detail & Related papers (2023-10-12T07:38:10Z) - On the Statistical Benefits of Temporal Difference Learning [6.408072565019087]
Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions.
We show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates.
We prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states.
arXiv Detail & Related papers (2023-01-30T21:02:25Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates [110.92598350897192]
Q-Learning has proven effective at learning a policy to perform control tasks.
estimation noise becomes a bias after the max operator in the policy improvement step.
We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
arXiv Detail & Related papers (2021-10-28T00:07:19Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD) [0.0]
Predictor-Corrector Temporal Difference (PCTD) is what I call the translated time Reinforcement(RL) algorithm from the theory of discrete time ODE.
I propose a new class of TD learning algorithms.
The parameter being approximated has a guaranteed order of magnitude reduction in the Taylor Series error of the solution to the ODE.
arXiv Detail & Related papers (2021-04-15T18:54:16Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Adaptive Temporal Difference Learning with Linear Function Approximation [29.741034258674205]
This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning.
We develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation.
We evaluate the performance of AdaTD(0) and AdaTD($lambda$) on several standard reinforcement learning tasks.
arXiv Detail & Related papers (2020-02-20T02:32:40Z) - Reanalysis of Variance Reduced Temporal Difference Learning [57.150444843282]
A variance reduced TD (VRTD) algorithm was proposed by Korda and La, which applies the variance reduction technique directly to the online TD learning with Markovian samples.
We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate.
arXiv Detail & Related papers (2020-01-07T05:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.