Related papers: Reducing Sampling Error in Batch Temporal Difference Learning

Reducing Sampling Error in Batch Temporal Difference Learning

URL: http://arxiv.org/abs/2008.06738v1
Date: Sat, 15 Aug 2020 15:30:06 GMT
Title: Reducing Sampling Error in Batch Temporal Difference Learning
Authors: Brahma Pavse, Ishan Durugkar, Josiah Hanna, Peter Stone
Abstract summary: Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data.
Score: 42.30708351947417
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch -- not the true probability of the action under the given policy. To address this limitation, we introduce \textit{policy sampling error corrected}-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates the empirical distribution of actions in each state in the batch and then uses importance sampling to correct for the mismatch between the empirical weighting and the correct weighting for updates following each action. We refine the concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed batch of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0).

Related papers

Collaborative Value Function Estimation Under Model Mismatch: A Federated Temporal Difference Analysis [55.13545823385091]
Federated reinforcement learning (FedRL) enables collaborative learning while preserving data privacy by preventing direct data exchange between agents. In real-world applications, each agent may experience slightly different transition dynamics, leading to inherent model mismatches. We show that even moderate levels of information sharing can significantly mitigate environment-specific errors.
arXiv Detail & Related papers (2025-03-21T18:06:28Z)
Accelerating Multi-Task Temporal Difference Learning under Low-Rank Representation [12.732028509861829]
We study policy evaluation problems in multi-task reinforcement learning (RL) under a low-rank representation setting. We propose a new variant of TD learning method, where we integrate the so-called truncated singular value decomposition step into the update of TD learning. Our empirical results show that the proposed method significantly outperforms the classic TD learning, where the performance gap increases as the rank $r$ decreases.
arXiv Detail & Related papers (2025-03-03T20:07:45Z)
Discerning Temporal Difference Learning [5.439020425819001]
Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL) We propose a novel TD algorithm named discerning TD learning (DTD)
arXiv Detail & Related papers (2023-10-12T07:38:10Z)
On the Statistical Benefits of Temporal Difference Learning [6.408072565019087]
Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions. We show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. We prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states.
arXiv Detail & Related papers (2023-01-30T21:02:25Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates [110.92598350897192]
Q-Learning has proven effective at learning a policy to perform control tasks. estimation noise becomes a bias after the max operator in the policy improvement step. We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
arXiv Detail & Related papers (2021-10-28T00:07:19Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD) [0.0]
Predictor-Corrector Temporal Difference (PCTD) is what I call the translated time Reinforcement(RL) algorithm from the theory of discrete time ODE. I propose a new class of TD learning algorithms. The parameter being approximated has a guaranteed order of magnitude reduction in the Taylor Series error of the solution to the ODE.
arXiv Detail & Related papers (2021-04-15T18:54:16Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
Adaptive Temporal Difference Learning with Linear Function Approximation [29.741034258674205]
This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning. We develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation. We evaluate the performance of AdaTD(0) and AdaTD($lambda$) on several standard reinforcement learning tasks.
arXiv Detail & Related papers (2020-02-20T02:32:40Z)
Reanalysis of Variance Reduced Temporal Difference Learning [57.150444843282]
A variance reduced TD (VRTD) algorithm was proposed by Korda and La, which applies the variance reduction technique directly to the online TD learning with Markovian samples. We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate.
arXiv Detail & Related papers (2020-01-07T05:32:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.