The Definitive Guide to Policy Gradients in Deep Reinforcement Learning:
Theory, Algorithms and Implementations
- URL: http://arxiv.org/abs/2401.13662v2
- Date: Fri, 1 Mar 2024 08:58:05 GMT
- Title: The Definitive Guide to Policy Gradients in Deep Reinforcement Learning:
Theory, Algorithms and Implementations
- Authors: Matthias Lehmann
- Abstract summary: In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning.
We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, various powerful policy gradient algorithms have been
proposed in deep reinforcement learning. While all these algorithms build on
the Policy Gradient Theorem, the specific design choices differ significantly
across algorithms. We provide a holistic overview of on-policy policy gradient
algorithms to facilitate the understanding of both their theoretical
foundations and their practical implementations. In this overview, we include a
detailed proof of the continuous version of the Policy Gradient Theorem,
convergence results and a comprehensive discussion of practical algorithms. We
compare the most prominent algorithms on continuous control environments and
provide insights on the benefits of regularization. All code is available at
https://github.com/Matt00n/PolicyGradientsJax.
Related papers
- Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning [0.46040036610482665]
Cumulative Prospect Theory (CPT) has been developed to provide a better model for human-based decision-making supported by empirical evidence.
A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem.
We show that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.
arXiv Detail & Related papers (2024-10-03T15:45:39Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - A policy gradient approach for Finite Horizon Constrained Markov Decision Processes [6.682382456607199]
We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time.
To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints.
arXiv Detail & Related papers (2022-10-10T09:52:02Z) - Continuous MDP Homomorphisms and Homomorphic Policy Gradient [51.25171126424949]
We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces.
We propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously.
arXiv Detail & Related papers (2022-09-15T15:26:49Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - On the Linear convergence of Natural Policy Gradient Algorithm [5.027714423258537]
Recent interest in Reinforcement Learning has motivated the study of methods inspired by optimization.
Among these is the Natural Policy Gradient, which is a mirror descent variant for MDPs.
We present improved finite time convergence bounds, and show that this algorithm has geometric convergence rate.
arXiv Detail & Related papers (2021-05-04T11:26:12Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z) - Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization [71.03797261151605]
Adaptivity is an important yet under-studied property in modern optimization theory.
Our algorithm is proved to achieve the best-available convergence for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
arXiv Detail & Related papers (2020-02-13T05:42:27Z) - Population-Guided Parallel Policy Search for Reinforcement Learning [17.360163137926]
A new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL)
In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information.
arXiv Detail & Related papers (2020-01-09T10:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.