Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER
- URL: http://arxiv.org/abs/2012.01399v1
- Date: Wed, 2 Dec 2020 18:47:06 GMT
- Title: Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER
- Authors: Markus Holzleitner, Lukas Gruber, Jos\'e Arjona-Medina, Johannes
Brandstetter, Sepp Hochreiter
- Abstract summary: We show convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER.
Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning.
- Score: 6.9478331974594045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We prove under commonly used assumptions the convergence of actor-critic
reinforcement learning algorithms, which simultaneously learn a policy
function, the actor, and a value function, the critic. Both functions can be
deep neural networks of arbitrary complexity. Our framework allows showing
convergence of the well known Proximal Policy Optimization (PPO) and of the
recently introduced RUDDER. For the convergence proof we employ recently
introduced techniques from the two time-scale stochastic approximation theory.
Our results are valid for actor-critic methods that use episodic samples and
that have a policy that becomes more greedy during learning. Previous
convergence proofs assume linear function approximation, cannot treat episodic
examples, or do not consider that policies become greedy. The latter is
relevant since optimal policies are typically deterministic.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Policy Gradient and Actor-Critic Learning in Continuous Time and Space:
Theory and Algorithms [1.776746672434207]
We study policy gradient (PG) for reinforcement learning in continuous time and space.
We propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly.
arXiv Detail & Related papers (2021-11-22T14:27:04Z) - Provable Benefits of Actor-Critic Methods for Offline Reinforcement
Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy [122.01837436087516]
We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms.
We establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time.
arXiv Detail & Related papers (2020-08-02T14:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.