Direct Gradient Temporal Difference Learning
- URL: http://arxiv.org/abs/2308.01170v1
- Date: Wed, 2 Aug 2023 14:16:22 GMT
- Title: Direct Gradient Temporal Difference Learning
- Authors: Xiaochi Qian, Shangtong Zhang
- Abstract summary: Off-policy learning enables a reinforcement learning agent to reason counterfactually about policies that are not executed.
It can lead to instability when combined with function approximation and bootstrapping.
In this paper, we propose a direct method to solve the double sampling issue by simply using two samples in a Markovian data stream with an increasing gap.
- Score: 23.297137490591382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning enables a reinforcement learning (RL) agent to reason
counterfactually about policies that are not executed and is one of the most
important ideas in RL. It, however, can lead to instability when combined with
function approximation and bootstrapping, two arguably indispensable
ingredients for large-scale reinforcement learning. This is the notorious
deadly triad. Gradient Temporal Difference (GTD) is one powerful tool to solve
the deadly triad. Its success results from solving a doubling sampling issue
indirectly with weight duplication or Fenchel duality. In this paper, we
instead propose a direct method to solve the double sampling issue by simply
using two samples in a Markovian data stream with an increasing gap. The
resulting algorithm is as computationally efficient as GTD but gets rid of
GTD's extra weights. The only price we pay is a logarithmically increasing
memory as time progresses. We provide both asymptotic and finite sample
analysis, where the convergence rate is on-par with the canonical on-policy
temporal difference learning. Key to our analysis is a novel refined
discretization of limiting ODEs.
Related papers
- Deep Reinforcement Learning with Gradient Eligibility Traces [25.47053572017618]
We propose three gradient-based methods to achieve fast and stable off-policy learning in deep reinforcement learning.<n>We provide a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms.<n>We evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments.
arXiv Detail & Related papers (2025-07-12T00:12:05Z) - Actor-Critics Can Achieve Optimal Sample Efficiency [15.033410073144939]
We introduce a novel actor-critic algorithm that attains a sample-complexity of $O(dH5 log|mathcalA|/epsilon2 + d H4 log|mathcalF|/ epsilon2)$ trajectories.<n>We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL.
arXiv Detail & Related papers (2025-05-06T17:32:39Z) - Accelerating Multi-Task Temporal Difference Learning under Low-Rank Representation [12.732028509861829]
We study policy evaluation problems in multi-task reinforcement learning (RL) under a low-rank representation setting.
We propose a new variant of TD learning method, where we integrate the so-called truncated singular value decomposition step into the update of TD learning.
Our empirical results show that the proposed method significantly outperforms the classic TD learning, where the performance gap increases as the rank $r$ decreases.
arXiv Detail & Related papers (2025-03-03T20:07:45Z) - Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandits (RCDB), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z) - Statistical Efficiency of Distributional Temporal Difference Learning [24.03281329962804]
We analyze the finite-sample performance of distributional temporal difference learning (CTD) and quantile temporal difference learning (QTD)
For a $gamma$-discounted infinite-horizon decision process, we show that for NTD we need $tildeOleft(frac1varepsilon2p (1-gamma)2pright)$ to achieve an $varepsilon$-optimal estimator with high probability.
We establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.
arXiv Detail & Related papers (2024-03-09T06:19:53Z) - Function Value Learning: Adaptive Learning Rates Based on the Polyak
Stepsize and Function Splitting in ERM [6.542289202349586]
We focus on solving a finite sum-of-terms problem, also known as empirical risk minimization.
We first detail an idealized adaptive method called $textttSPS_+$ that makes use of the sampled loss values.
We then move onto to develop $textttFUVAL$, a variant of $textttSPS_+$ where the loss values at optimality are gradually learned.
arXiv Detail & Related papers (2023-07-26T22:12:31Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z) - Near-Optimal Adversarial Reinforcement Learning with Switching Costs [43.895798638743784]
We show how to develop a provably efficient algorithm for adversarial RL with switching costs.
Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret is no longer achievable.
We propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known.
arXiv Detail & Related papers (2023-02-08T23:41:29Z) - Tractable Optimality in Episodic Latent MABs [75.17357040707347]
We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps.
Depending on the length of the episode, the learner may not be able to estimate accurately the latent context.
We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
arXiv Detail & Related papers (2022-10-05T22:53:46Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Fast Rates for Nonparametric Online Learning: From Realizability to
Learning in Games [36.969021834291745]
We propose a proper learning algorithm which gets a near-optimal mistake bound in terms of the sequential fat-shattering dimension of the hypothesis class.
This result answers a question as to whether proper learners could achieve near-optimal mistake bounds.
For the real-valued (regression) setting, the optimal mistake bound was not even known for improper learners.
arXiv Detail & Related papers (2021-11-17T05:24:21Z) - Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - An Analysis of Frame-skipping in Reinforcement Learning [13.680685626360903]
On many Atari console games, reinforcement learning algorithms deliver substantially better policies when run with $d > 1$.
We focus on "action-repetition", the common restriction of this choice to $d$-length sequences of the same action.
We show that this loss may be offset by the gain brought to learning by a smaller task horizon.
arXiv Detail & Related papers (2021-02-07T04:59:09Z) - Provably Efficient Reinforcement Learning with Linear Function
Approximation Under Adaptivity Constraints [94.76881135901753]
We consider two popular limited adaptivity models: batch learning model and rare policy switch model.
Our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrtd3H3T + dHT/B)$ regret.
For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrtd3H3T[1+T/(dH)]dH/B)$ regret.
arXiv Detail & Related papers (2021-01-06T18:56:07Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.