Sample Complexity of Policy-Based Methods under Off-Policy Sampling and
Linear Function Approximation
- URL: http://arxiv.org/abs/2208.03247v1
- Date: Fri, 5 Aug 2022 15:59:05 GMT
- Title: Sample Complexity of Policy-Based Methods under Off-Policy Sampling and
Linear Function Approximation
- Authors: Zaiwei Chen, and Siva Theja Maguluri
- Abstract summary: off-policy sampling and linear function approximation are employed for policy evaluation.
Various policy update rules, including natural policy gradient (NPG), are considered for policy update.
We establish for the first time an overall $mathcalO(epsilon-2)$ sample complexity for finding an optimal policy.
- Score: 8.465228064780748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we study policy-based methods for solving the reinforcement
learning problem, where off-policy sampling and linear function approximation
are employed for policy evaluation, and various policy update rules, including
natural policy gradient (NPG), are considered for policy update. To solve the
policy evaluation sub-problem in the presence of the deadly triad, we propose a
generic algorithm framework of multi-step TD-learning with generalized
importance sampling ratios, which includes two specific algorithms: the
$\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm
is single time-scale, has provable finite-sample guarantees, and overcomes the
high variance issue in off-policy learning.
As for the policy update, we provide a universal analysis using only the
contraction property and the monotonicity property of the Bellman operator to
establish the geometric convergence under various policy update rules.
Importantly, by viewing NPG as an approximate way of implementing policy
iteration, we establish the geometric convergence of NPG without introducing
regularization, and without using mirror descent type of analysis as in
existing literature. Combining the geometric convergence of the policy update
with the finite-sample analysis of the policy evaluation, we establish for the
first time an overall $\mathcal{O}(\epsilon^{-2})$ sample complexity for
finding an optimal policy (up to a function approximation error) using
policy-based methods under off-policy sampling and linear function
approximation.
Related papers
- Fast Policy Learning for Linear Quadratic Control with Entropy
Regularization [10.771650397337366]
This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems.
Assuming access to the exact policy evaluation, both proposed approaches are proven to converge linearly in finding optimal policies of the regularized LQC.
arXiv Detail & Related papers (2023-11-23T19:08:39Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - On The Convergence Of Policy Iteration-Based Reinforcement Learning With
Monte Carlo Policy Evaluation [11.345796608258434]
We show that a first-visit version of such a policy iteration scheme converges to the optimal policy provided that the policy improvement step uses lookahead.
We also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.
arXiv Detail & Related papers (2023-01-23T20:32:41Z) - Linear Convergence of Natural Policy Gradient Methods with Log-Linear
Policies [115.86431674214282]
We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class.
We show that both methods attain linear convergence rates and $mathcalO (1/epsilon2)$ sample complexities using a simple, non-adaptive geometrically increasing step size.
arXiv Detail & Related papers (2022-10-04T06:17:52Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.