On The Convergence Of Policy Iteration-Based Reinforcement Learning With
Monte Carlo Policy Evaluation
- URL: http://arxiv.org/abs/2301.09709v1
- Date: Mon, 23 Jan 2023 20:32:41 GMT
- Title: On The Convergence Of Policy Iteration-Based Reinforcement Learning With
Monte Carlo Policy Evaluation
- Authors: Anna Winnicki, R. Srikant
- Abstract summary: We show that a first-visit version of such a policy iteration scheme converges to the optimal policy provided that the policy improvement step uses lookahead.
We also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.
- Score: 11.345796608258434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common technique in reinforcement learning is to evaluate the value
function from Monte Carlo simulations of a given policy, and use the estimated
value function to obtain a new policy which is greedy with respect to the
estimated value function. A well-known longstanding open problem in this
context is to prove the convergence of such a scheme when the value function of
a policy is estimated from data collected from a single sample path obtained
from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8
of [Tsitsiklis, 2002]). We present a solution to the open problem by showing
that a first-visit version of such a policy iteration scheme indeed converges
to the optimal policy provided that the policy improvement step uses lookahead
[Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a
simple greedy policy improvement. We provide results both for the original open
problem in the tabular setting and also present extensions to the function
approximation setting, where we show that the policy resulting from the
algorithm performs close to the optimal policy within a function approximation
error.
Related papers
- Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Sample Complexity of Policy-Based Methods under Off-Policy Sampling and
Linear Function Approximation [8.465228064780748]
off-policy sampling and linear function approximation are employed for policy evaluation.
Various policy update rules, including natural policy gradient (NPG), are considered for policy update.
We establish for the first time an overall $mathcalO(epsilon-2)$ sample complexity for finding an optimal policy.
arXiv Detail & Related papers (2022-08-05T15:59:05Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Variational Policy Gradient Method for Reinforcement Learning with
General Utilities [38.54243339632217]
In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction.
In this paper, we consider policy in Decision Problems, where the objective converges a general concave utility function.
We derive a new Variational Policy Gradient Theorem for RL with general utilities.
arXiv Detail & Related papers (2020-07-04T17:51:53Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.