Regularly Updated Deterministic Policy Gradient Algorithm
- URL: http://arxiv.org/abs/2007.00169v1
- Date: Wed, 1 Jul 2020 01:18:25 GMT
- Title: Regularly Updated Deterministic Policy Gradient Algorithm
- Authors: Shuai Han and Wenbo Zhou and Shuai L\"u and Jiayu Yu
- Abstract summary: This paper proposes a Regularly Updated Deterministic (RUD) policy gradient algorithm for these problems.
This paper theoretically proves that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure.
- Score: 11.57539530904012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Deterministic Policy Gradient (DDPG) algorithm is one of the most
well-known reinforcement learning methods. However, this method is inefficient
and unstable in practical applications. On the other hand, the bias and
variance of the Q estimation in the target function are sometimes difficult to
control. This paper proposes a Regularly Updated Deterministic (RUD) policy
gradient algorithm for these problems. This paper theoretically proves that the
learning procedure with RUD can make better use of new data in replay buffer
than the traditional procedure. In addition, the low variance of the Q value in
RUD is more suitable for the current Clipped Double Q-learning strategy. This
paper has designed a comparison experiment against previous methods, an
ablation experiment with the original DDPG, and other analytical experiments in
Mujoco environments. The experimental results demonstrate the effectiveness and
superiority of RUD.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Coordinate-wise Control Variates for Deep Policy Gradients [23.24910014825916]
The effect of vector-valued baselines for neural net policies is under-explored.
We show that lower variance can be obtained with such baselines than with the conventional scalar-valued baseline.
arXiv Detail & Related papers (2021-07-11T07:36:01Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning.
Proximal gradient methods are used as standard approaches to solve OWL regression.
We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z) - Stochastic Recursive Momentum for Policy Gradient Methods [28.277961340108313]
We propose a novel algorithm named STOchastic Recursive Momentum for Policy Gradient (Storm-PG)
Storm-PG enjoys a provably sharp $O (1/epsilon3)$ sample bound for STORM-PG, matching the best-known convergence rate for policy gradient algorithm.
Numerical experiments depicts the superiority of our algorithm over comparative policy gradient algorithms.
arXiv Detail & Related papers (2020-03-09T17:59:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.