A Policy Optimization Method Towards Optimal-time Stability
- URL: http://arxiv.org/abs/2301.00521v2
- Date: Fri, 13 Oct 2023 01:47:48 GMT
- Title: A Policy Optimization Method Towards Optimal-time Stability
- Authors: Shengjie Wang, Fengbo Lan, Xiang Zheng, Yuxue Cao, Oluwatosin Oseni,
Haotian Xu, Tao Zhang, Yang Gao
- Abstract summary: We propose a policy optimization technique incorporating sampling-based Lyapunov stability.
Our approach enables the system's state to reach an equilibrium point within an optimal time.
- Score: 15.722871779526526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In current model-free reinforcement learning (RL) algorithms, stability
criteria based on sampling methods are commonly utilized to guide policy
optimization. However, these criteria only guarantee the infinite-time
convergence of the system's state to an equilibrium point, which leads to
sub-optimality of the policy. In this paper, we propose a policy optimization
technique incorporating sampling-based Lyapunov stability. Our approach enables
the system's state to reach an equilibrium point within an optimal time and
maintain stability thereafter, referred to as "optimal-time stability". To
achieve this, we integrate the optimization method into the Actor-Critic
framework, resulting in the development of the Adaptive Lyapunov-based
Actor-Critic (ALAC) algorithm. Through evaluations conducted on ten robotic
tasks, our approach outperforms previous studies significantly, effectively
guiding the system to generate stable patterns.
Related papers
- Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - KCRL: Krasovskii-Constrained Reinforcement Learning with Guaranteed
Stability in Nonlinear Dynamical Systems [66.9461097311667]
We propose a model-based reinforcement learning framework with formal stability guarantees.
The proposed method learns the system dynamics up to a confidence interval using feature representation.
We show that KCRL is guaranteed to learn a stabilizing policy in a finite number of interactions with the underlying unknown system.
arXiv Detail & Related papers (2022-06-03T17:27:04Z) - Generalized Proximal Policy Optimization with Sample Reuse [8.325359814939517]
We combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms.
We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization.
This motivates an off-policy version of the popular algorithm that we call Generalized Proximal Policy Optimization with Sample Reuse.
arXiv Detail & Related papers (2021-10-29T20:22:31Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Reinforcement Learning for Adaptive Optimal Stationary Control of Linear
Stochastic Systems [15.410124023805249]
This paper studies the adaptive optimal stationary control of continuous-time linear systems with both additive and multiplicative noises.
A novel off-policy reinforcement learning algorithm, named optimistic least-squares-based iteration policy, is proposed.
arXiv Detail & Related papers (2021-07-16T09:27:02Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - A Dynamical Systems Approach for Convergence of the Bayesian EM
Algorithm [59.99439951055238]
We show how (discrete-time) Lyapunov stability theory can serve as a powerful tool to aid, or even lead, in the analysis (and potential design) of optimization algorithms that are not necessarily gradient-based.
The particular ML problem that this paper focuses on is that of parameter estimation in an incomplete-data Bayesian framework via the popular optimization algorithm known as maximum a posteriori expectation-maximization (MAP-EM)
We show that fast convergence (linear or quadratic) is achieved, which could have been difficult to unveil without our adopted S&C approach.
arXiv Detail & Related papers (2020-06-23T01:34:18Z) - Optimistic Distributionally Robust Policy Optimization [2.345728642535161]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prone to converge to a sub-optimal solution as they limit policy representation to a particular parametric distribution class.
We develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRO) algorithm to solve the trust region constrained optimization problem without parameterizing the policies.
Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability.
arXiv Detail & Related papers (2020-06-14T06:36:18Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Convergence Guarantees of Policy Optimization Methods for Markovian Jump
Linear Systems [3.3343656101775365]
We show that the Gauss-Newton method converges to the optimal state feedback controller for MJLS at a linear rate if at a controller which stabilizes the closed-loop dynamics in the mean sense.
We present an example to support our theory.
arXiv Detail & Related papers (2020-02-10T21:13:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.