Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators
- URL: http://arxiv.org/abs/2303.08431v4
- Date: Sat, 10 Aug 2024 23:14:00 GMT
- Title: Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators
- Authors: Yinbin Han, Meisam Razaviyayn, Renyuan Xu,
- Abstract summary: We study the optimal rate in nearly linear-quadratic regulator systems.
We propose a policy that is guaranteed to the globally optimal rate with a gradient algorithm.
- Score: 11.400431211239958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate.
Related papers
- Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning [53.97335841137496]
We propose an oracle-efficient algorithm, dubbed Pessimistic Least-Square Value Iteration (PNLSVI) for offline RL with non-linear function approximation.
Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation.
arXiv Detail & Related papers (2023-10-02T17:42:01Z) - Best of Both Worlds in Online Control: Competitive Ratio and Policy
Regret [61.59646565655169]
We show that several recently proposed online control algorithms achieve the best of both worlds: sublinear regret vs. the best DAC policy selected in hindsight.
We conclude that sublinear regret vs. the optimal competitive policy is attainable when the linear dynamical system is unknown.
arXiv Detail & Related papers (2022-11-21T07:29:08Z) - Neural System Level Synthesis: Learning over All Stabilizing Policies
for Nonlinear Systems [0.0]
We propose a Neural SLS (Neur-SLS) approach guaranteeing closed-loop stability during and after parameter optimization.
We exploit recent Deep Neural Network (DNN) models based on Recurrent Equilibrium Networks (RENs) to learn over a rich class of nonlinear stable operators.
arXiv Detail & Related papers (2022-03-22T15:22:31Z) - Learning over All Stabilizing Nonlinear Controllers for a
Partially-Observed Linear System [4.3012765978447565]
We propose a parameterization of nonlinear output feedback controllers for linear dynamical systems.
Our approach guarantees the closed-loop stability of partially observable linear dynamical systems without requiring any constraints to be satisfied.
arXiv Detail & Related papers (2021-12-08T10:43:47Z) - Youla-REN: Learning Nonlinear Feedback Policies with Robust Stability
Guarantees [5.71097144710995]
This paper presents a parameterization of nonlinear controllers for uncertain systems building on a recently developed neural network architecture.
The proposed framework has "built-in" guarantees of stability, i.e., all policies in the search space result in a contracting (globally exponentially stable) closed-loop system.
arXiv Detail & Related papers (2021-12-02T13:52:37Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - On the Stability of Nonlinear Receding Horizon Control: A Geometric
Perspective [72.7951562665449]
widespread adoption of nonlinear Receding Control (RHC) strategies by industry has to more than 30 years.
This paper takes the first step towards understanding the role of global geometry in the role of global-based control.
arXiv Detail & Related papers (2021-03-27T22:59:37Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z) - Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a
Finite Horizon [3.867363075280544]
We explore reinforcement learning methods for finding the optimal policy in the linear quadratic regulator (LQR) problem.
We produce a global linear convergence guarantee for the setting of finite time horizon and state dynamics under weak assumptions.
We show results for the case where we assume a model for the underlying dynamics and where we apply the method to the data directly.
arXiv Detail & Related papers (2020-11-20T09:51:49Z) - The Power of Linear Controllers in LQR Control [39.76359052907755]
We compute the policy regret between three distinct control policies.
We show that cost of the optimal offline linear policy converges to the cost of the optimal online policy.
Although we focus on the setting where the noise is, our results imply new lower bounds on the policy regret achievable when the noise is chosen by an adaptive adversary.
arXiv Detail & Related papers (2020-02-07T00:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.