Related papers: Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach

Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach

URL: http://arxiv.org/abs/2508.01718v1
Date: Sun, 03 Aug 2025 11:02:25 GMT
Title: Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach
Authors: Yeongjong Kim, Yeoneung Kim, Minseok Kim, Namkyeong Cho,
Abstract summary: We propose a physics-informed neural network policy iteration framework (PINN-PI)<n>At each iteration, a neural network is trained to approximate the value function by minimizing the residual of a linear PDE induced by a fixed policy.<n>We demonstrate the effectiveness of our method on several benchmark problems, including gradient cartpole, pendulum high-dimensional linear quadratic regulation (LQR) problems in up to 10D.
Score: 2.8988658640181826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a physics-informed neural network policy iteration (PINN-PI) framework for solving stochastic optimal control problems governed by second-order Hamilton--Jacobi--Bellman (HJB) equations. At each iteration, a neural network is trained to approximate the value function by minimizing the residual of a linear PDE induced by a fixed policy. This linear structure enables systematic $L^2$ error control at each policy evaluation step, and allows us to derive explicit Lipschitz-type bounds that quantify how value gradient errors propagate to the policy updates. This interpretability provides a theoretical basis for evaluating policy quality during training. Our method extends recent deterministic PINN-based approaches to stochastic settings, inheriting the global exponential convergence guarantees of classical policy iteration under mild conditions. We demonstrate the effectiveness of our method on several benchmark problems, including stochastic cartpole, pendulum problems and high-dimensional linear quadratic regulation (LQR) problems in up to 10D.

Related papers

Solving nonconvex Hamilton--Jacobi--Isaacs equations with PINN-based policy iteration [1.3654846342364308]
We present a framework that combines classical dynamic programming with neural networks (PINNs) to solve non-subscriber Hamilton-Jacobi-Isaac equations.<n>Our results suggest that integrating PINNs with policy policy is a practical and theoretically grounded method for solving high-dimensional, nonsubscriber HJI equations.
arXiv Detail & Related papers (2025-07-21T10:06:53Z)
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
arXiv Detail & Related papers (2025-06-08T13:37:38Z)
Multilinear Tensor Low-Rank Approximation for Policy-Gradient Methods in Reinforcement Learning [27.868175900131313]
Reinforcement learning (RL) aims to estimate the action to take given a (time-varying) state.<n>This paper postulates multi-linear mappings to efficiently estimate the parameters of the RL policy.<n>We leverage the PARAFAC decomposition to design tensor low-rank policies.
arXiv Detail & Related papers (2025-01-08T23:22:08Z)
High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms. We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Learning Stochastic Parametric Differentiable Predictive Control Policies [2.042924346801313]
We present a scalable alternative called parametric differentiable predictive control (SP-DPC) for unsupervised learning of neural control policies. SP-DPC is formulated as a deterministic approximation to the parametric constrained optimal control problem. We provide theoretical probabilistic guarantees for policies learned via the SP-DPC method on closed-loop constraints and chance satisfaction.
arXiv Detail & Related papers (2022-03-02T22:46:32Z)
Distributional Offline Continuous-Time Reinforcement Learning with Neural Physics-Informed PDEs (SciPhy RL for DOCTR-L) [0.0]
This paper addresses distributional offline continuous-time reinforcement learning (DOCTR-L) with policies for high-dimensional optimal control. A data-driven solution of the soft HJB equation uses methods of Neural PDEs and Physics-Informed Neural Networks developed in the field of Scientific Machine Learning (SciML) Our algorithm called Deep DOCTR-L converts offline high-dimensional data into an optimal policy in one step by reducing it to supervised learning.
arXiv Detail & Related papers (2021-04-02T13:22:14Z)
Gaussian Process-based Min-norm Stabilizing Controller for Control-Affine Systems with Uncertain Input Effects and Dynamics [90.81186513537777]
We propose a novel compound kernel that captures the control-affine nature of the problem. We show that this resulting optimization problem is convex, and we call it Gaussian Process-based Control Lyapunov Function Second-Order Cone Program (GP-CLF-SOCP)
arXiv Detail & Related papers (2020-11-14T01:27:32Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.