Zeroth-order Deterministic Policy Gradient
- URL: http://arxiv.org/abs/2006.07314v2
- Date: Sat, 11 Jul 2020 18:16:22 GMT
- Title: Zeroth-order Deterministic Policy Gradient
- Authors: Harshat Kumar and Dionysios S. Kalogerias and George J. Pappas and
Alejandro Ribeiro
- Abstract summary: We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
- Score: 116.87117204825105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deterministic Policy Gradient (DPG) removes a level of randomness from
standard randomized-action Policy Gradient (PG), and demonstrates substantial
empirical success for tackling complex dynamic problems involving Markov
decision processes. At the same time, though, DPG loses its ability to learn in
a model-free (i.e., actor-only) fashion, frequently necessitating the use of
critics in order to obtain consistent estimates of the associated policy-reward
gradient. In this work, we introduce Zeroth-order Deterministic Policy Gradient
(ZDPG), which approximates policy-reward gradients via two-point stochastic
evaluations of the $Q$-function, constructed by properly designed
low-dimensional action-space perturbations. Exploiting the idea of random
horizon rollouts for obtaining unbiased estimates of the $Q$-function, ZDPG
lifts the dependence on critics and restores true model-free policy learning,
while enjoying built-in and provable algorithmic stability. Additionally, we
present new finite sample complexity bounds for ZDPG, which improve upon
existing results by up to two orders of magnitude. Our findings are supported
by several numerical experiments, which showcase the effectiveness of ZDPG in a
practical setting, and its advantages over both PG and Baseline PG.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Actor-Critic Reinforcement Learning with Phased Actor [10.577516871906816]
We propose a novel phased actor in actor-critic (PAAC) method to improve policy gradient estimation.
PAAC accounts for both $Q$ value and TD error in its actor update.
Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate.
arXiv Detail & Related papers (2024-04-18T01:27:31Z) - Global Convergence of Natural Policy Gradient with Hessian-aided
Momentum Variance Reduction [6.320200835271402]
Natural policy gradient (NPG) and its variants are widely-used policy search methods in reinforcement learning.
New NPG variant coined NPG-HM is developed in this paper, which utilizes the Hessian-aided momentum technique for variance reduction.
Experiments on Mujoco-based environments demonstrate the superior performance of NPG-HM over other state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2024-01-02T07:56:17Z) - Optimal Estimation of Off-Policy Policy Gradient via Double Fitted
Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy.
Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance.
In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient
Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover)
We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions.
We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z) - Momentum-Based Policy Gradient Methods [133.53164856723782]
We propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning.
In particular, we present a non-adaptive version of IS-MBPG method, which also reaches the best known sample complexity of $O(epsilon-3)$ without any large batches.
arXiv Detail & Related papers (2020-07-13T20:44:15Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z) - Stochastic Recursive Momentum for Policy Gradient Methods [28.277961340108313]
We propose a novel algorithm named STOchastic Recursive Momentum for Policy Gradient (Storm-PG)
Storm-PG enjoys a provably sharp $O (1/epsilon3)$ sample bound for STORM-PG, matching the best-known convergence rate for policy gradient algorithm.
Numerical experiments depicts the superiority of our algorithm over comparative policy gradient algorithms.
arXiv Detail & Related papers (2020-03-09T17:59:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.