Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
- URL: http://arxiv.org/abs/2102.11866v1
- Date: Tue, 23 Feb 2021 18:56:13 GMT
- Title: Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
- Authors: Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang
- Abstract summary: We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
- Score: 131.45028999325797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Designing off-policy reinforcement learning algorithms is typically a very
challenging task, because a desirable iteration update often involves an
expectation over an on-policy distribution. Prior off-policy actor-critic (AC)
algorithms have introduced a new critic that uses the density ratio for
adjusting the distribution mismatch in order to stabilize the convergence, but
at the cost of potentially introducing high biases due to the estimation errors
of both the density ratio and value function. In this paper, we develop a
doubly robust off-policy AC (DR-Off-PAC) for discounted MDP, which can take
advantage of learned nuisance functions to reduce estimation errors. Moreover,
DR-Off-PAC adopts a single timescale structure, in which both actor and critics
are updated simultaneously with constant stepsize, and is thus more sample
efficient than prior algorithms that adopt either two timescale or nested-loop
structure. We study the finite-time convergence rate and characterize the
sample complexity for DR-Off-PAC to attain an $\epsilon$-accurate optimal
policy. We also show that the overall convergence of DR-Off-PAC is doubly
robust to the approximation errors that depend only on the expressive power of
approximation functions. To the best of our knowledge, our study establishes
the first overall sample complexity analysis for a single time-scale off-policy
AC algorithm.
Related papers
- A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence [7.586600116278698]
Deep Actor-Critic network (DNN) combine Actor-Critic network (DNN) and deep neural network (DNN)
Deep Actor-Critic network (DNN) combine Actor-Critic network (DNN) and deep neural network (DNN)
Deep Actor-Critic network (DNN) combine Actor-Critic network (DNN) and deep neural network (DNN)
Deep Actor-Critic network (DNN) combine Actor-Critic network (DNN) and deep neural network (DNN)
Deep Actor-Critic network (DNN)
arXiv Detail & Related papers (2023-06-10T10:04:54Z) - Asymptotically Unbiased Instance-wise Regularized Partial AUC
Optimization: Theory and Algorithm [101.44676036551537]
One-way Partial AUC (OPAUC) and Two-way Partial AUC (TPAUC) measures the average performance of a binary classifier.
Most of the existing methods could only optimize PAUC approximately, leading to inevitable biases that are not controllable.
We present a simpler reformulation of the PAUC problem via distributional robust optimization AUC.
arXiv Detail & Related papers (2022-10-08T08:26:22Z) - Finite-Time Analysis of Fully Decentralized Single-Timescale
Actor-Critic [4.94128206910124]
We introduce a fully decentralized Actor-Critic (AC) algorithm, where actor, critic, and global reward estimator are updated in an alternating manner.
We show that our algorithm has sample complexity of $tildemathcalO(epsilon-2)$ under Markovian sampling.
We also provide a local action privacy-preserving version of our algorithm and its analysis.
arXiv Detail & Related papers (2022-06-12T13:14:14Z) - When AUC meets DRO: Optimizing Partial AUC for Deep Learning with
Non-Convex Convergence Guarantee [51.527543027813344]
We propose systematic and efficient gradient-based methods for both one-way and two-way partial AUC (pAUC)
For both one-way and two-way pAUC, we propose two algorithms and prove their convergence for optimizing their two formulations, respectively.
arXiv Detail & Related papers (2022-03-01T01:59:53Z) - Accelerated and instance-optimal policy evaluation with linear function
approximation [17.995515643150657]
Existing algorithms fail to match at least one of these lower bounds.
We develop an accelerated, variance-reduced fast temporal difference algorithm that simultaneously matches both lower bounds and attains a strong notion of instance-optimality.
arXiv Detail & Related papers (2021-12-24T17:21:04Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Approximation Algorithms for Sparse Principal Component Analysis [57.5357874512594]
Principal component analysis (PCA) is a widely used dimension reduction technique in machine learning and statistics.
Various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis.
We present thresholding as a provably accurate, time, approximation algorithm for the SPCA problem.
arXiv Detail & Related papers (2020-06-23T04:25:36Z) - Non-asymptotic Convergence Analysis of Two Time-scale (Natural)
Actor-Critic Algorithms [58.57004511121862]
Actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies.
We show that two time-scale AC requires the overall sample complexity at the order of $mathcalO(epsilon-2.5log3(epsilon-1))$ to attain an $epsilon$-accurate stationary point.
We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling.
arXiv Detail & Related papers (2020-05-07T15:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.