Related papers: Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

URL: http://arxiv.org/abs/2602.16629v1
Date: Wed, 18 Feb 2026 17:24:27 GMT
Title: Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes
Authors: Ethan Blaser, Jiuqi Wang, Shangtong Zhang,
Abstract summary: Differential temporal difference (TD) learning algorithms are a major advance for average reward RL.<n>Existing convergence guarantees require a local clock in learning rates tied to state visit counts.<n>We prove the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock.
Score: 19.67390261007849
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.

Related papers

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs [19.079556051442168]
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks.<n>But for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly noisy.<n>We propose a stabilization method for REINFORCE/ GRPO-style algorithms that scales learning rate based on effective sample size to dampen unreliable updates.
arXiv Detail & Related papers (2026-02-19T18:40:51Z)
Transitive RL: Value Learning via Divide and Conquer [54.190627631246166]
Transitive Reinforcement Learning (TRL) is a new value learning algorithm based on a divide-and-conquer paradigm.<n>Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming.
arXiv Detail & Related papers (2025-10-26T03:32:31Z)
Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation [5.638124543342179]
We present the first robust temporal-difference learning with linear function approximation.<n>Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts.
arXiv Detail & Related papers (2025-10-02T07:01:41Z)
Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z)
Relative Entropy Pathwise Policy Optimization [66.03329137921949]
We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories.<n>We show how to combine policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z)
Stabilizing Temporal Difference Learning via Implicit Stochastic Recursion [2.1301560294088318]
Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL)<n>We propose implicit TD algorithms that reformulate TD updates into fixed point equations.<n>Our results show that implicit TD algorithms are applicable to a much broader range of step sizes.
arXiv Detail & Related papers (2025-05-02T15:57:54Z)
Uncertainty quantification for Markov chain induced martingales with application to temporal difference learning [55.197497603087065]
We analyze the performance of the Temporal Difference (TD) learning algorithm with linear function approximations.<n>We establish novel and general high-dimensional concentration inequalities and Berry-Esseen bounds for vector-valued martingales induced by Markov chains.
arXiv Detail & Related papers (2025-02-19T15:33:55Z)
Time-Scale Separation in Q-Learning: Extending TD($\ riangle$) for Action-Value Function Decomposition [0.0]
This paper introduces Q($Delta$)-Learning, an extension of TD($Delta$) for the Q-Learning framework. TD($Delta$) facilitates efficient learning over several time scales by breaking the Q($Delta$)-function into distinct discount factors. We demonstrate through theoretical analysis and practical evaluations on standard benchmarks like Atari that Q($Delta$)-Learning surpasses conventional Q-Learning and TD learning methods.
arXiv Detail & Related papers (2024-11-21T11:03:07Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
DPO: A Differential and Pointwise Control Approach to Reinforcement Learning [3.2857981869020327]
Reinforcement learning (RL) in continuous state-action spaces remains challenging in scientific computing.<n>We introduce Differential Reinforcement Learning (Differential RL), a novel framework that reformulates RL from a continuous-time control perspective.<n>We develop Differential Policy Optimization (DPO), a pointwise, stage-wise algorithm that refines local movement operators.
arXiv Detail & Related papers (2024-04-24T03:11:12Z)
Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic [61.968469104271676]
We propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
arXiv Detail & Related papers (2023-01-28T04:12:56Z)
Parameter-free Gradient Temporal Difference Learning [3.553493344868414]
We develop gradient-based temporal difference algorithms for reinforcement learning. Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to $log$ factors. Our experiments demonstrate that our methods maintain high prediction performance relative to fully-tuned baselines, with no tuning whatsoever.
arXiv Detail & Related papers (2021-05-10T06:07:05Z)
Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP. DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize. We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.