Related papers: The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

URL: http://arxiv.org/abs/2401.13662v2
Date: Fri, 1 Mar 2024 08:58:05 GMT
Title: The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations
Authors: Matthias Lehmann
Abstract summary: In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax.

Related papers

Wasserstein Policy Optimization [15.269409777313662]
Wasserstein Policy Optimization (WPO) is an actor-critic algorithm for reinforcement learning in continuous action spaces. We show results on the DeepMind Control Suite and a magnetic confinement task which compare favorably with state-of-the-art continuous control methods.
arXiv Detail & Related papers (2025-05-01T17:07:01Z)
Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning [0.46040036610482665]
Cumulative Prospect Theory (CPT) has been developed to provide a better model for human-based decision-making supported by empirical evidence. A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem. We show that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.
arXiv Detail & Related papers (2024-10-03T15:45:39Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z)
A policy gradient approach for Finite Horizon Constrained Markov Decision Processes [6.682382456607199]
We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time. To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints.
arXiv Detail & Related papers (2022-10-10T09:52:02Z)
Continuous MDP Homomorphisms and Homomorphic Policy Gradient [51.25171126424949]
We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces. We propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously.
arXiv Detail & Related papers (2022-09-15T15:26:49Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques. VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z)
On the Linear convergence of Natural Policy Gradient Algorithm [5.027714423258537]
Recent interest in Reinforcement Learning has motivated the study of methods inspired by optimization. Among these is the Natural Policy Gradient, which is a mirror descent variant for MDPs. We present improved finite time convergence bounds, and show that this algorithm has geometric convergence rate.
arXiv Detail & Related papers (2021-05-04T11:26:12Z)
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms. Specifically, we investigate the consequences of "code-level optimizations:" Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)
Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization [71.03797261151605]
Adaptivity is an important yet under-studied property in modern optimization theory. Our algorithm is proved to achieve the best-available convergence for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
arXiv Detail & Related papers (2020-02-13T05:42:27Z)
Population-Guided Parallel Policy Search for Reinforcement Learning [17.360163137926]
A new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL) In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information.
arXiv Detail & Related papers (2020-01-09T10:13:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.