Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence
Analysis
- URL: http://arxiv.org/abs/2010.13272v4
- Date: Mon, 22 May 2023 15:12:08 GMT
- Title: Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence
Analysis
- Authors: Shaocong Ma, Yi Zhou, Shaofeng Zou
- Abstract summary: We develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting.
Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller convergence error than both the conventional TDC and the variance-reduced TD.
- Score: 27.679514676804057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Variance reduction techniques have been successfully applied to
temporal-difference (TD) learning and help to improve the sample complexity in
policy evaluation. However, the existing work applied variance reduction to
either the less popular one time-scale TD algorithm or the two time-scale GTD
algorithm but with a finite number of i.i.d.\ samples, and both algorithms
apply to only the on-policy setting. In this work, we develop a variance
reduction scheme for the two time-scale TDC algorithm in the off-policy setting
and analyze its non-asymptotic convergence rate over both i.i.d.\ and Markovian
samples. In the i.i.d.\ setting, our algorithm {matches the best-known lower
bound $\tilde{O}(\epsilon^{-1}$).} In the Markovian setting, our algorithm
achieves the state-of-the-art sample complexity $O(\epsilon^{-1} \log
{\epsilon}^{-1})$ that is near-optimal. Experiments demonstrate that the
proposed variance-reduced TDC achieves a smaller asymptotic convergence error
than both the conventional TDC and the variance-reduced TD.
Related papers
- Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process.
For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature.
We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Stochastic Dimension-reduced Second-order Methods for Policy
Optimization [11.19708535159457]
We propose several new second-order algorithms for policy optimization that only require gradient and Hessian-vector product in each iteration.
Specifically, we propose a dimension-reduced second-order method (DR-SOPO) which repeatedly solves a projected two-dimensional trust region subproblem.
We show that DR-SOPO obtains an $mathcalO(epsilon-3.5)$ complexity for reaching approximate first-order stationary condition.
In addition, we present an enhanced algorithm (DVR-SOPO) which further improves the complexity to $mathcalO
arXiv Detail & Related papers (2023-01-28T12:09:58Z) - Multi-Agent Off-Policy TD Learning: Finite-Time Analysis with
Near-Optimal Sample Complexity and Communication Complexity [13.100926925535578]
We develop two decentralized TD with correction (TDC) algorithms for multi-agent off-policy TD learning.
Our algorithms preserve full privacy of the actions, policies and rewards of the agents, and adopt mini-batch sampling to reduce the sampling variance and communication frequency.
arXiv Detail & Related papers (2021-03-24T12:48:08Z) - Sample Complexity Bounds for Two Timescale Value-based Reinforcement
Learning Algorithms [65.09383385484007]
Two timescale approximation (SA) has been widely used in value-based reinforcement learning algorithms.
We study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ algorithms.
arXiv Detail & Related papers (2020-11-10T11:36:30Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Non-asymptotic Convergence Analysis of Two Time-scale (Natural)
Actor-Critic Algorithms [58.57004511121862]
Actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies.
We show that two time-scale AC requires the overall sample complexity at the order of $mathcalO(epsilon-2.5log3(epsilon-1))$ to attain an $epsilon$-accurate stationary point.
We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling.
arXiv Detail & Related papers (2020-05-07T15:42:31Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.