Related papers: Regularly Updated Deterministic Policy Gradient Algorithm

Regularly Updated Deterministic Policy Gradient Algorithm

URL: http://arxiv.org/abs/2007.00169v1
Date: Wed, 1 Jul 2020 01:18:25 GMT
Title: Regularly Updated Deterministic Policy Gradient Algorithm
Authors: Shuai Han and Wenbo Zhou and Shuai L\"u and Jiayu Yu
Abstract summary: This paper proposes a Regularly Updated Deterministic (RUD) policy gradient algorithm for these problems. This paper theoretically proves that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure.
Score: 11.57539530904012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep Deterministic Policy Gradient (DDPG) algorithm is one of the most well-known reinforcement learning methods. However, this method is inefficient and unstable in practical applications. On the other hand, the bias and variance of the Q estimation in the target function are sometimes difficult to control. This paper proposes a Regularly Updated Deterministic (RUD) policy gradient algorithm for these problems. This paper theoretically proves that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure. In addition, the low variance of the Q value in RUD is more suitable for the current Clipped Double Q-learning strategy. This paper has designed a comparison experiment against previous methods, an ablation experiment with the original DDPG, and other analytical experiments in Mujoco environments. The experimental results demonstrate the effectiveness and superiority of RUD.

Related papers

TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint [11.347808936693152]
Model-based reinforcement learning algorithms combine model-based planning and learned value/policy prior. Existing methods that rely on standard SAC-style policy iteration for value learning often result in emphpersistent value overestimation. We propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning.
arXiv Detail & Related papers (2025-02-05T19:08:42Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Coordinate-wise Control Variates for Deep Policy Gradients [23.24910014825916]
The effect of vector-valued baselines for neural net policies is under-explored. We show that lower variance can be obtained with such baselines than with the conventional scalar-valued baseline.
arXiv Detail & Related papers (2021-07-11T07:36:01Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning. Proximal gradient methods are used as standard approaches to solve OWL regression. We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z)
Stochastic Recursive Momentum for Policy Gradient Methods [28.277961340108313]
We propose a novel algorithm named STOchastic Recursive Momentum for Policy Gradient (Storm-PG) Storm-PG enjoys a provably sharp $O (1/epsilon3)$ sample bound for STORM-PG, matching the best-known convergence rate for policy gradient algorithm. Numerical experiments depicts the superiority of our algorithm over comparative policy gradient algorithms.
arXiv Detail & Related papers (2020-03-09T17:59:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.