On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces
- URL: http://arxiv.org/abs/2201.12332v2
- Date: Mon, 31 Jan 2022 03:40:59 GMT
- Title: On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces
- Authors: Amrit Singh Bedi, Souradip Chakraborty, Anjaly Parayil, Brian Sadler,
Pratap Tokekar, Alec Koppel
- Abstract summary: We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
- Score: 23.186300629667134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We focus on parameterized policy search for reinforcement learning over
continuous action spaces. Typically, one assumes the score function associated
with a policy is bounded, which fails to hold even for Gaussian policies. To
properly address this issue, one must introduce an exploration tolerance
parameter to quantify the region in which it is bounded. Doing so incurs a
persistent bias that appears in the attenuation rate of the expected policy
gradient norm, which is inversely proportional to the radius of the action
space. To mitigate this hidden bias, heavy-tailed policy parameterizations may
be used, which exhibit a bounded score function, but doing so can cause
instability in algorithmic updates. To address these issues, in this work, we
study the convergence of policy gradient algorithms under heavy-tailed
parameterizations, which we propose to stabilize with a combination of mirror
ascent-type updates and gradient tracking. Our main theoretical contribution is
the establishment that this scheme converges with constant step and batch
sizes, whereas prior works require these parameters to respectively shrink to
null or grow to infinity. Experimentally, this scheme under a heavy-tailed
policy parameterization yields improved reward accumulation across a variety of
settings as compared with standard benchmarks.
Related papers
- Soft Actor-Critic with Beta Policy via Implicit Reparameterization Gradients [0.0]
Soft actor-critic (SAC) mitigates poor sample efficiency by combining policy optimization and off-policy learning.
It is limited to distributions whose gradients can be computed through the re parameterization trick.
We extend this technique to train SAC with the beta policy on simulated robot locomotion environments.
Experimental results show that the beta policy is a viable alternative, as it outperforms the normal policy and is on par with the normal policy.
arXiv Detail & Related papers (2024-09-08T04:30:51Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Optimal scheduling of entropy regulariser for continuous-time
linear-quadratic reinforcement learning [9.779769486156631]
Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy.
This exploration-exploitation trade-off is determined by the strength of entropy regularisation.
We prove that the regret, for both learning algorithms, is of the order $mathcalO(sqrtN) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
arXiv Detail & Related papers (2022-08-08T23:36:40Z) - On the Sample Complexity and Metastability of Heavy-tailed Policy Search
in Continuous Control [47.71156648737803]
Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model.
We characterize a defined defined chain, identifying that policies associated with Levy Processes of a tail index yield to wider peaks.
arXiv Detail & Related papers (2021-06-15T20:12:44Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based
Reinforcement Learning [14.325835899564664]
entropy-regularized value-based reinforcement learning method can ensure the monotonic improvement of policies at each policy update.
We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.
arXiv Detail & Related papers (2020-08-25T04:09:18Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.