Global Convergence of Policy Gradient for Entropy Regularized Linear-Quadratic Control with multiplicative noise
- URL: http://arxiv.org/abs/2510.02896v1
- Date: Fri, 03 Oct 2025 11:03:12 GMT
- Title: Global Convergence of Policy Gradient for Entropy Regularized Linear-Quadratic Control with multiplicative noise
- Authors: Gabriel Diaz, Lucky Li, Wenhao Zhang,
- Abstract summary: Reinforcement Learning (RL) has emerged as a powerful framework for sequential decision-making in dynamic environments.<n>This paper investigates RL-based control for entropy-regularized Quadra (LQC)<n>We introduce a novel model free RL algorithm: Sample-Based Regularized Policy Gradient (SBRPG)
- Score: 7.339958589013675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning (RL) has emerged as a powerful framework for sequential decision-making in dynamic environments, particularly when system parameters are unknown. This paper investigates RL-based control for entropy-regularized Linear Quadratic control (LQC) problems with multiplicative noises over an infinite time horizon. First, we adapt the Regularized Policy Gradient (RPG) algorithm to stochastic optimal control settings, proving that despite the non-convexity of the problem, RPG converges globally under conditions of gradient domination and near-smoothness. Second, based on zero-order optimization approach, we introduce a novel model free RL algorithm: Sample-Based Regularized Policy Gradient (SB-RPG). SB-RPG operates without knowledge of system parameters yet still retains strong theoretical guarantees of global convergence. Our model leverages entropy regularization to accelerate convergence and address the exploration versus exploitation trade-off inherent in RL. Numerical simulations validate the theoretical results and demonstrate the efficacy of SB-RPG in unknown-parameters environments.
Related papers
- Full error analysis of policy gradient learning algorithms for exploratory linear quadratic mean-field control problem in continuous time with common noise [0.0]
We study policy gradient (PG) learning and first demonstrate convergence in a model-based setting.
We prove the global linear convergence and sample complexity of the PG algorithm with two-point gradient estimates in a model-free setting.
In this setting, the parameterized optimal policies are learned from samples of the states and population distribution.
arXiv Detail & Related papers (2024-08-05T14:11:51Z) - Sublinear Regret for a Class of Continuous-Time Linear-Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning (RL) for a class of continuous-time linear-quadratic (LQ) control problems for diffusions.<n>We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an RL algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality [55.06411438416805]
Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications.<n>This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS- DDR) to efficiently train parametric actor policies.<n>It is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - Optimal scheduling of entropy regulariser for continuous-time
linear-quadratic reinforcement learning [9.779769486156631]
Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy.
This exploration-exploitation trade-off is determined by the strength of entropy regularisation.
We prove that the regret, for both learning algorithms, is of the order $mathcalO(sqrtN) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
arXiv Detail & Related papers (2022-08-08T23:36:40Z) - Finite-time System Identification and Adaptive Control in Autoregressive
Exogenous Systems [79.67879934935661]
We study the problem of system identification and adaptive control of unknown ARX systems.
We provide finite-time learning guarantees for the ARX systems under both open-loop and closed-loop data collection.
arXiv Detail & Related papers (2021-08-26T18:00:00Z) - Derivative-Free Policy Optimization for Risk-Sensitive and Robust
Control Design: Implicit Regularization and Sample Complexity [15.940861063732608]
Direct policy search serves as one of the workhorses in modern reinforcement learning (RL)
We investigate the convergence theory of policy robustness (PG) methods for the linear risk-sensitive and robust controller.
One feature of our algorithms is that during the learning phase, a certain level complexity/risk-sensitivity controller is preserved.
arXiv Detail & Related papers (2021-01-04T16:00:46Z) - Gaussian Process-based Min-norm Stabilizing Controller for
Control-Affine Systems with Uncertain Input Effects and Dynamics [90.81186513537777]
We propose a novel compound kernel that captures the control-affine nature of the problem.
We show that this resulting optimization problem is convex, and we call it Gaussian Process-based Control Lyapunov Function Second-Order Cone Program (GP-CLF-SOCP)
arXiv Detail & Related papers (2020-11-14T01:27:32Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Adaptive Control and Regret Minimization in Linear Quadratic Gaussian
(LQG) Setting [91.43582419264763]
We propose LqgOpt, a novel reinforcement learning algorithm based on the principle of optimism in the face of uncertainty.
LqgOpt efficiently explores the system dynamics, estimates the model parameters up to their confidence interval, and deploys the controller of the most optimistic model.
arXiv Detail & Related papers (2020-03-12T19:56:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.