Regularized Gradient Temporal-Difference Learning
- URL: http://arxiv.org/abs/2601.20599v1
- Date: Wed, 28 Jan 2026 13:37:42 GMT
- Title: Regularized Gradient Temporal-Difference Learning
- Authors: Hyunjun Na, Donghwan Lee,
- Abstract summary: Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation.<n>We propose a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization.<n>This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular.
- Score: 6.622208195193136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.
Related papers
- Nonconvex Regularization for Feature Selection in Reinforcement Learning [7.408148824204063]
This work proposes an efficient batch algorithm for feature selection in reinforcement learning (RL) with theoretical convergence guarantees.<n> Numerical experiments demonstrate that the proposed approach substantially outperforms state-selection scenarios.
arXiv Detail & Related papers (2025-09-19T06:21:20Z) - An Accelerated Alternating Partial Bregman Algorithm for ReLU-based Matrix Decomposition [0.0]
In this paper, we aim to investigate the sparse low-rank characteristics rectified on non-negative matrices.<n>We propose a novel regularization term incorporating useful structures in clustering and compression tasks.<n>We derive corresponding closed-form solutions while maintaining the $L$-smooth property always holds for any $Lge 1$.
arXiv Detail & Related papers (2025-03-04T08:20:34Z) - Mirror Descent Actor Critic via Bounded Advantage Learning [0.0]
Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates.<n>We propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains.
arXiv Detail & Related papers (2025-02-06T08:14:03Z) - Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [69.1820058966619]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model.<n>We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z) - Statistical Inference for Temporal Difference Learning with Linear Function Approximation [55.80276145563105]
We investigate the statistical properties of Temporal Difference learning with Polyak-Ruppert averaging.<n>We make three theoretical contributions that improve upon the current state-of-the-art results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Strongly-polynomial time and validation analysis of policy gradient methods [3.722665817361884]
This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL)<n>By incorporating this advantage gap function into the design of step size rules, we deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy.<n>This is the first time that such strong convergence properties have been established for policy gradient methods.
arXiv Detail & Related papers (2024-09-28T18:56:48Z) - Asymptotically Unbiased Instance-wise Regularized Partial AUC
Optimization: Theory and Algorithm [101.44676036551537]
One-way Partial AUC (OPAUC) and Two-way Partial AUC (TPAUC) measures the average performance of a binary classifier.
Most of the existing methods could only optimize PAUC approximately, leading to inevitable biases that are not controllable.
We present a simpler reformulation of the PAUC problem via distributional robust optimization AUC.
arXiv Detail & Related papers (2022-10-08T08:26:22Z) - Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations.
We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning.
We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z) - Bellman Residual Orthogonalization for Offline Reinforcement Learning [53.17258888552998]
We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space.
We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
arXiv Detail & Related papers (2022-03-24T01:04:17Z) - Instance-Dependent Confidence and Early Stopping for Reinforcement
Learning [99.57168572237421]
Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure.
This research provides guarantees that explain textitex post the performance differences observed.
A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice.
arXiv Detail & Related papers (2022-01-21T04:25:35Z) - Accelerated and instance-optimal policy evaluation with linear function
approximation [17.995515643150657]
Existing algorithms fail to match at least one of these lower bounds.
We develop an accelerated, variance-reduced fast temporal difference algorithm that simultaneously matches both lower bounds and attains a strong notion of instance-optimality.
arXiv Detail & Related papers (2021-12-24T17:21:04Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.