Related papers: Implicit Updates for Average-Reward Temporal Difference Learning

Implicit Updates for Average-Reward Temporal Difference Learning

URL: http://arxiv.org/abs/2510.06149v1
Date: Tue, 07 Oct 2025 17:19:39 GMT
Title: Implicit Updates for Average-Reward Temporal Difference Learning
Authors: Hwanwoo Kim, Dongkyu Derek Cho, Eric Laber,
Abstract summary: Empirically, average-reward implicit TD($lambda$) operates reliably over a much broader range of step-sizes.<n>This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($lambda$)
Score: 1.6440434996206623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($\lambda$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($\lambda$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($\lambda$). In contrast to prior finite-time analyses of average-reward TD($\lambda$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($\lambda$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($\lambda$).

Related papers

Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z)
Moments Matter:Stabilizing Policy Optimization using Return Distributions [9.430246534202857]
In continuous control tasks, even small parameter shifts can produce unstable gaits.<n>We propose an alternative that takes advantage of environmentality to update-induced variability.
arXiv Detail & Related papers (2026-01-05T05:27:11Z)
Relative Entropy Pathwise Policy Optimization [66.03329137921949]
We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories.<n>We show how to combine policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z)
Stabilizing Temporal Difference Learning via Implicit Stochastic Recursion [2.1301560294088318]
Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL)<n>We propose implicit TD algorithms that reformulate TD updates into fixed point equations.<n>Our results show that implicit TD algorithms are applicable to a much broader range of step sizes.
arXiv Detail & Related papers (2025-05-02T15:57:54Z)
Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data.<n>Existing update gradient would heavily destroy the performance on previous datasets during CIT process.<n>We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z)
A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP [1.0923877073891446]
We analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation.<n>We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging.<n>These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning.
arXiv Detail & Related papers (2024-06-12T05:49:53Z)
Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation [44.27439128304058]
We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice.
arXiv Detail & Related papers (2022-10-12T04:37:54Z)
Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates [110.92598350897192]
Q-Learning has proven effective at learning a policy to perform control tasks. estimation noise becomes a bias after the max operator in the policy improvement step. We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
arXiv Detail & Related papers (2021-10-28T00:07:19Z)
PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method [49.93717224277131]
We propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period. We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity to bes.
arXiv Detail & Related papers (2021-10-13T17:40:12Z)
Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning [9.359939442911127]
This paper focuses on resetting variational inequalities (VI) under Markovian noise. A prominent application of our algorithmic developments is the policy evaluation problem in reinforcement learning.
arXiv Detail & Related papers (2020-11-15T04:05:22Z)
Adaptive Temporal Difference Learning with Linear Function Approximation [29.741034258674205]
This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning. We develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation. We evaluate the performance of AdaTD(0) and AdaTD($lambda$) on several standard reinforcement learning tasks.
arXiv Detail & Related papers (2020-02-20T02:32:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.