Proximal Deterministic Policy Gradient
- URL: http://arxiv.org/abs/2008.00759v1
- Date: Mon, 3 Aug 2020 10:19:59 GMT
- Title: Proximal Deterministic Policy Gradient
- Authors: Marco Maggipinto and Gian Antonio Susto and Pratik Chaudhari
- Abstract summary: We introduce two techniques to improve off-policy Reinforcement Learning (RL) algorithms.
We exploit the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate.
We demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.
- Score: 20.951797549505986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces two simple techniques to improve off-policy
Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a
stochastic proximal point iteration. The target network plays the role of the
variable of optimization and the value network computes the proximal operator.
Second, we exploits the two value functions commonly employed in
state-of-the-art off-policy algorithms to provide an improved action value
estimate through bootstrapping with limited increase of computational
resources. Further, we demonstrate significant performance improvement over
state-of-the-art algorithms on standard continuous-control RL benchmarks.
Related papers
- Fast Two-Time-Scale Stochastic Gradient Method with Applications in Reinforcement Learning [5.325297567945828]
We propose a new method for two-time-scale optimization that achieves significantly faster convergence than the prior arts.
We characterize the proposed algorithm under various conditions and show how it specializes on online sample-based methods.
arXiv Detail & Related papers (2024-05-15T19:03:08Z) - Decoupled Prioritized Resampling for Offline RL [120.49021589395005]
We propose Offline Prioritized Experience Replay (OPER) for offline reinforcement learning.
OPER features a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training.
We show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - Quantile-Based Deep Reinforcement Learning using Two-Timescale Policy
Gradient Algorithms [0.0]
We parameterize the policy controlling actions by neural networks, and propose a novel policy gradient algorithm called Quantile-Based Policy Optimization (QPO)
Our numerical results indicate that the proposed algorithms outperform the existing baseline algorithms under the quantile criterion.
arXiv Detail & Related papers (2023-05-12T04:47:02Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z) - Mirror Descent Policy Optimization [41.46894905097985]
We propose an efficient RL algorithm, called em mirror descent policy optimization (MDPO)
MDPO iteratively updates the policy by em approximately solving a trust-region problem.
We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact em not a necessity for high performance gains in TRPO.
arXiv Detail & Related papers (2020-05-20T01:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.