Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor
Critic under State Distribution Mismatch
- URL: http://arxiv.org/abs/2111.02997v1
- Date: Thu, 4 Nov 2021 16:48:45 GMT
- Title: Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor
Critic under State Distribution Mismatch
- Authors: Shangtong Zhang, Remi Tachet, Romain Laroche
- Abstract summary: We establish the global optimality and convergence rate of an off-policy actor critic algorithm.
Our work goes beyond existing works on the optimality of policy gradient methods.
- Score: 29.02336004872336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we establish the global optimality and convergence rate of an
off-policy actor critic algorithm in the tabular setting without using density
ratio to correct the discrepancy between the state distribution of the behavior
policy and that of the target policy. Our work goes beyond existing works on
the optimality of policy gradient methods in that existing works use the exact
policy gradient for updating the policy parameters while we use an approximate
and stochastic update step. Our update step is not a gradient update because we
do not use a density ratio to correct the state distribution, which aligns well
with what practitioners do. Our update is approximate because we use a learned
critic instead of the true value function. Our update is stochastic because at
each step the update is done for only the current state action pair. Moreover,
we remove several restrictive assumptions from existing works in our analysis.
Central to our work is the finite sample analysis of a generic stochastic
approximation algorithm with time-inhomogeneous update operators on
time-inhomogeneous Markov chains, based on its uniform contraction properties.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - Beyond the Policy Gradient Theorem for Efficient Policy Updates in
Actor-Critic Algorithms [10.356356383401566]
In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states.
We discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target.
We introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $mathcalO(t-1)$ under classic assumptions.
arXiv Detail & Related papers (2022-02-15T15:04:10Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.