Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy
- URL: http://arxiv.org/abs/2008.00483v2
- Date: Sun, 13 Jun 2021 05:25:16 GMT
- Title: Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy
- Authors: Zuyue Fu, Zhuoran Yang, Zhaoran Wang
- Abstract summary: We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms.
We establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time.
- Score: 122.01837436087516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the global convergence and global optimality of actor-critic, one of
the most popular families of reinforcement learning algorithms. While most
existing works on actor-critic employ bi-level or two-timescale updates, we
focus on the more practical single-timescale setting, where the actor and
critic are updated simultaneously. Specifically, in each iteration, the critic
update is obtained by applying the Bellman evaluation operator only once while
the actor is updated in the policy gradient direction computed using the
critic. Moreover, we consider two function approximation settings where both
the actor and critic are represented by linear or deep neural networks. For
both cases, we prove that the actor sequence converges to a globally optimal
policy at a sublinear $O(K^{-1/2})$ rate, where $K$ is the number of
iterations. To the best of our knowledge, we establish the rate of convergence
and global optimality of single-timescale actor-critic with linear function
approximation for the first time. Moreover, under the broader scope of policy
optimization with nonlinear function approximation, we prove that actor-critic
with deep neural network finds the globally optimal policy at a sublinear rate
for the first time.
Related papers
- Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation [5.945710235932345]
We present the first two-timescale critic-actor algorithm with function approximation in the long-run average reward setting.
A notable feature of our analysis is that unlike recent single-timescale actor-critic algorithms, we present a complete convergence analysis of our scheme.
arXiv Detail & Related papers (2024-02-02T12:48:49Z) - Decision-Aware Actor-Critic with Function Approximation and Theoretical
Guarantees [12.259191000019033]
Actor-critic (AC) methods are widely used in reinforcement learning (RL)
We design a joint objective for training the actor and critic in a decision-aware fashion.
We empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
arXiv Detail & Related papers (2023-05-24T15:34:21Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic [137.04558017227583]
Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years.
We take a mean-field perspective on the evolution and convergence of feature-based neural AC.
We prove that neural AC finds the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2021-12-27T06:09:50Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER [6.9478331974594045]
We show convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER.
Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning.
arXiv Detail & Related papers (2020-12-02T18:47:06Z) - Non-asymptotic Convergence Analysis of Two Time-scale (Natural)
Actor-Critic Algorithms [58.57004511121862]
Actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies.
We show that two time-scale AC requires the overall sample complexity at the order of $mathcalO(epsilon-2.5log3(epsilon-1))$ to attain an $epsilon$-accurate stationary point.
We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling.
arXiv Detail & Related papers (2020-05-07T15:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.