On the Sample Complexity and Metastability of Heavy-tailed Policy Search
in Continuous Control
- URL: http://arxiv.org/abs/2106.08414v1
- Date: Tue, 15 Jun 2021 20:12:44 GMT
- Title: On the Sample Complexity and Metastability of Heavy-tailed Policy Search
in Continuous Control
- Authors: Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec
Koppel
- Abstract summary: Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model.
We characterize a defined defined chain, identifying that policies associated with Levy Processes of a tail index yield to wider peaks.
- Score: 47.71156648737803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning is a framework for interactive decision-making with
incentives sequentially revealed across time without a system dynamics model.
Due to its scaling to continuous spaces, we focus on policy search where one
iteratively improves a parameterized policy with stochastic policy gradient
(PG) updates. In tabular Markov Decision Problems (MDPs), under persistent
exploration and suitable parameterization, global optimality may be obtained.
By contrast, in continuous space, the non-convexity poses a pathological
challenge as evidenced by existing convergence results being mostly limited to
stationarity or arbitrary local extrema. To close this gap, we step towards
persistent exploration in continuous space through policy parameterizations
defined by distributions of heavier tails defined by tail-index parameter
alpha, which increases the likelihood of jumping in state space. Doing so
invalidates smoothness conditions of the score function common to PG. Thus, we
establish how the convergence rate to stationarity depends on the policy's tail
index alpha, a Holder continuity parameter, integrability conditions, and an
exploration tolerance parameter introduced here for the first time. Further, we
characterize the dependence of the set of local maxima on the tail index
through an exit and transition time analysis of a suitably defined Markov
chain, identifying that policies associated with Levy Processes of a heavier
tail converge to wider peaks. This phenomenon yields improved stability to
perturbations in supervised learning, which we corroborate also manifests in
improved performance of policy search, especially when myopic and farsighted
incentives are misaligned.
Related papers
- Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - A Fisher-Rao gradient flow for entropy-regularised Markov decision
processes in Polish spaces [10.777806006475297]
We study the global convergence of a Fisher-Rao policy gradient flow for infinite-horizon entropy-regularised Markov decision processes with Polish state and action space.
We establish the global well-posedness of the gradient flow and demonstrate its exponential convergence to the optimal policy.
arXiv Detail & Related papers (2023-10-04T16:41:36Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Wasserstein Actor-Critic: Directed Exploration via Optimism for
Continuous-Actions Control [41.7453231409493]
Wasserstein Actor-Critic ( WAC) is an actor-critic architecture inspired by the Wasserstein Q-Learning (WQL) citepwql.
WAC enforces exploration in a principled way by guiding the policy learning process with the optimization of an upper bound of the Q-value estimates.
arXiv Detail & Related papers (2023-03-04T10:52:20Z) - Linear convergence of a policy gradient method for finite horizon
continuous time stochastic control problems [3.7971225066055765]
This paper proposes a provably convergent gradient method for general continuous space-time control problems.
We show that the algorithm converges linearly to the point of control, and is stable with respect to policy by steps.
arXiv Detail & Related papers (2022-03-22T14:17:53Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - A Study of Policy Gradient on a Class of Exactly Solvable Models [35.90565839381652]
We explore the evolution of the policy parameters, for a special class of exactly solvable POMDPs, as a continuous-state Markov chain.
Our approach relies heavily on random walk theory, specifically on affine Weyl groups.
We analyze the probabilistic convergence of policy gradient to different local maxima of the value function.
arXiv Detail & Related papers (2020-11-03T17:27:53Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.