Proximal Policy Optimization with Continuous Bounded Action Space via
the Beta Distribution
- URL: http://arxiv.org/abs/2111.02202v1
- Date: Wed, 3 Nov 2021 13:13:00 GMT
- Title: Proximal Policy Optimization with Continuous Bounded Action Space via
the Beta Distribution
- Authors: Irving G. B. Petrazzini and Eric A. Antonelo
- Abstract summary: In this work, we investigate how this Beta policy performs when it is trained by the Proximal Policy Optimization algorithm on two continuous control tasks from OpenAI gym.
For both tasks, the Beta policy is superior to the Gaussian policy in terms of agent's final expected reward, also showing more stability and faster convergence of the training process.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning methods for continuous control tasks have evolved in
recent years generating a family of policy gradient methods that rely primarily
on a Gaussian distribution for modeling a stochastic policy. However, the
Gaussian distribution has an infinite support, whereas real world applications
usually have a bounded action space. This dissonance causes an estimation bias
that can be eliminated if the Beta distribution is used for the policy instead,
as it presents a finite support. In this work, we investigate how this Beta
policy performs when it is trained by the Proximal Policy Optimization (PPO)
algorithm on two continuous control tasks from OpenAI gym. For both tasks, the
Beta policy is superior to the Gaussian policy in terms of agent's final
expected reward, also showing more stability and faster convergence of the
training process. For the CarRacing environment with high-dimensional image
input, the agent's success rate was improved by 63% over the Gaussian policy.
Related papers
- Soft Actor-Critic with Beta Policy via Implicit Reparameterization Gradients [0.0]
Soft actor-critic (SAC) mitigates poor sample efficiency by combining policy optimization and off-policy learning.
It is limited to distributions whose gradients can be computed through the re parameterization trick.
We extend this technique to train SAC with the beta policy on simulated robot locomotion environments.
Experimental results show that the beta policy is a viable alternative, as it outperforms the normal policy and is on par with the normal policy.
arXiv Detail & Related papers (2024-09-08T04:30:51Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Bingham Policy Parameterization for 3D Rotations in Reinforcement
Learning [95.00518278458908]
We propose a new policy parameterization for representing 3D rotations during reinforcement learning.
Our proposed Bingham Policy parameterization (BPP) models the Bingham distribution and allows for better rotation prediction.
We evaluate BPP on the rotation Wahba problem task, as well as a set of vision-based next-best pose robot manipulation tasks from RLBench.
arXiv Detail & Related papers (2022-02-08T16:09:02Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor
Critic under State Distribution Mismatch [29.02336004872336]
We establish the global optimality and convergence rate of an off-policy actor critic algorithm.
Our work goes beyond existing works on the optimality of policy gradient methods.
arXiv Detail & Related papers (2021-11-04T16:48:45Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.