Non-Parametric Stochastic Policy Gradient with Strategic Retreat for
Non-Stationary Environment
- URL: http://arxiv.org/abs/2203.14905v1
- Date: Thu, 24 Mar 2022 21:41:13 GMT
- Title: Non-Parametric Stochastic Policy Gradient with Strategic Retreat for
Non-Stationary Environment
- Authors: Apan Dastider and Mingjie Lin
- Abstract summary: We propose a systematic methodology to learn a sequence of optimal control policies non-parametrically.
Our methodology has outperformed the well-established DDPG and TD3 methodology by a sizeable margin in terms of learning performance.
- Score: 1.5229257192293197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In modern robotics, effectively computing optimal control policies under
dynamically varying environments poses substantial challenges to the
off-the-shelf parametric policy gradient methods, such as the Deep
Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic policy
gradient (TD3). In this paper, we propose a systematic methodology to
dynamically learn a sequence of optimal control policies non-parametrically,
while autonomously adapting with the constantly changing environment dynamics.
Specifically, our non-parametric kernel-based methodology embeds a policy
distribution as the features in a non-decreasing Euclidean space, therefore
allowing its search space to be defined as a very high (possible infinite)
dimensional RKHS (Reproducing Kernel Hilbert Space). Moreover, by leveraging
the similarity metric computed in RKHS, we augmented our non-parametric
learning with the technique of AdaptiveH- adaptively selecting a time-frame
window of finishing the optimal part of whole action-sequence sampled on some
preceding observed state. To validate our proposed approach, we conducted
extensive experiments with multiple classic benchmarks and one simulated
robotics benchmark equipped with dynamically changing environments. Overall,
our methodology has outperformed the well-established DDPG and TD3 methodology
by a sizeable margin in terms of learning performance.
Related papers
- Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Stabilizing Policy Gradients for Stochastic Differential Equations via Consistency with Perturbation Process [11.01014302314467]
We focus on optimizing deep neural networks parameterized differential equations (SDEs)
We propose constraining the SDE to be consistent with its associated perturbation process.
Our framework offers a versatile selection of policy gradient methods to effectively and efficiently train SDEs.
arXiv Detail & Related papers (2024-03-07T02:24:45Z) - Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods [0.40964539027092917]
Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems.
In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming.
This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time.
arXiv Detail & Related papers (2023-10-04T09:21:01Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z) - Jointly Learning Environments and Control Policies with Projected
Stochastic Gradient Ascent [3.118384520557952]
We introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem.
In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation.
We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations.
arXiv Detail & Related papers (2020-06-02T16:08:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.