Stabilizing Temporal Difference Learning via Implicit Stochastic Recursion
- URL: http://arxiv.org/abs/2505.01361v2
- Date: Sun, 22 Jun 2025 22:31:04 GMT
- Title: Stabilizing Temporal Difference Learning via Implicit Stochastic Recursion
- Authors: Hwanwoo Kim, Panos Toulis, Eric Laber,
- Abstract summary: Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL)<n>We propose implicit TD algorithms that reformulate TD updates into fixed point equations.<n>Our results show that implicit TD algorithms are applicable to a much broader range of step sizes.
- Score: 2.1301560294088318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL). For nearly forty years, TD learning has served as a workhorse for applied RL as well as a building block for more complex and specialized algorithms. However, despite its widespread use, TD procedures are generally sensitive to step size specification. A poor choice of step size can dramatically increase variance and slow convergence in both on-policy and off-policy evaluation tasks. In practice, researchers use trial and error to identify stable step sizes, but these approaches tend to be ad hoc and inefficient. As an alternative, we propose implicit TD algorithms that reformulate TD updates into fixed point equations. Such updates are more stable and less sensitive to step size without sacrificing computational efficiency. Moreover, we derive asymptotic convergence guarantees and finite-time error bounds for our proposed implicit TD algorithms, which include implicit TD(0), TD($\lambda$), and TD with gradient correction (TDC). Our results show that implicit TD algorithms are applicable to a much broader range of step sizes, and thus provide a robust and versatile framework for policy evaluation and value approximation in modern RL tasks. We demonstrate these benefits empirically through extensive numerical examples spanning both on-policy and off-policy tasks.
Related papers
- Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Backstepping Temporal Difference Learning [6.663174194579773]
We propose a new convergent algorithm for off-policy TD-learning.<n>Our method relies on the backstepping technique, which is widely used in nonlinear control theory.<n> convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.
arXiv Detail & Related papers (2023-02-20T10:06:49Z) - Efficient Meta-Learning for Continual Learning with Taylor Expansion
Approximation [2.28438857884398]
Continual learning aims to alleviate catastrophic forgetting when handling consecutive tasks under non-stationary distributions.
We propose a novel efficient meta-learning algorithm for solving the online continual learning problem.
Our method achieves better or on-par performance and much higher efficiency compared to the state-of-the-art approaches.
arXiv Detail & Related papers (2022-10-03T04:57:05Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning
Method [49.93717224277131]
We propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period.
We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity to bes.
arXiv Detail & Related papers (2021-10-13T17:40:12Z) - A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states.
We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z) - Emphatic Algorithms for Deep Reinforcement Learning [43.17171330951343]
Temporal difference learning algorithms can become unstable when combined with function approximation and off-policy sampling.
Emphatic temporal difference (ETD($lambda$) algorithm ensures convergence in the linear case by appropriately weighting the TD($lambda$) updates.
We show that naively adapting ETD($lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance.
arXiv Detail & Related papers (2021-06-21T12:11:39Z) - Parameter-free Gradient Temporal Difference Learning [3.553493344868414]
We develop gradient-based temporal difference algorithms for reinforcement learning.
Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $log$ factors.
Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.
arXiv Detail & Related papers (2021-05-10T06:07:05Z) - Fast Distributionally Robust Learning with Variance Reduced Min-Max
Optimization [85.84019017587477]
Distributionally robust supervised learning is emerging as a key paradigm for building reliable machine learning systems for real-world applications.
Existing algorithms for solving Wasserstein DRSL involve solving complex subproblems or fail to make use of gradients.
We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable extra-gradient algorithms.
arXiv Detail & Related papers (2021-04-27T16:56:09Z) - Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD) [0.0]
Predictor-Corrector Temporal Difference (PCTD) is what I call the translated time Reinforcement(RL) algorithm from the theory of discrete time ODE.
I propose a new class of TD learning algorithms.
The parameter being approximated has a guaranteed order of magnitude reduction in the Taylor Series error of the solution to the ODE.
arXiv Detail & Related papers (2021-04-15T18:54:16Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Simple and optimal methods for stochastic variational inequalities, II:
Markovian noise and policy evaluation in reinforcement learning [9.359939442911127]
This paper focuses on resetting variational inequalities (VI) under Markovian noise.
A prominent application of our algorithmic developments is the policy evaluation problem in reinforcement learning.
arXiv Detail & Related papers (2020-11-15T04:05:22Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Reanalysis of Variance Reduced Temporal Difference Learning [57.150444843282]
A variance reduced TD (VRTD) algorithm was proposed by Korda and La, which applies the variance reduction technique directly to the online TD learning with Markovian samples.
We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate.
arXiv Detail & Related papers (2020-01-07T05:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.