Learning Stabilizing Policies in Stochastic Control Systems
- URL: http://arxiv.org/abs/2205.11991v1
- Date: Tue, 24 May 2022 11:38:22 GMT
- Title: Learning Stabilizing Policies in Stochastic Control Systems
- Authors: {\DJ}or{\dj}e \v{Z}ikeli\'c, Mathias Lechner, Krishnendu Chatterjee,
Thomas A. Henzinger
- Abstract summary: We study the effectiveness of jointly learning a policy together with a martingale certificate that proves its stability using a single learning algorithm.
Our results suggest that some form of pre-training of the policy is required for the joint optimization to repair and verify the policy successfully.
- Score: 20.045860624444494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we address the problem of learning provably stable neural
network policies for stochastic control systems. While recent work has
demonstrated the feasibility of certifying given policies using martingale
theory, the problem of how to learn such policies is little explored. Here, we
study the effectiveness of jointly learning a policy together with a martingale
certificate that proves its stability using a single learning algorithm. We
observe that the joint optimization problem becomes easily stuck in local
minima when starting from a randomly initialized policy. Our results suggest
that some form of pre-training of the policy is required for the joint
optimization to repair and verify the policy successfully.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Globally Stable Neural Imitation Policies [3.8772936189143445]
We introduce the Stable Neural Dynamical System (SNDS), an imitation learning regime which produces a policy with formal stability guarantees.
We deploy a neural policy architecture that facilitates the representation of stability based on Lyapunov theorem.
We successfully deploy the trained policies on a real-world manipulator arm.
arXiv Detail & Related papers (2024-03-07T00:20:11Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Learning Provably Stabilizing Neural Controllers for Discrete-Time
Stochastic Systems [18.349820472823055]
We introduce the notion of stabilizing ranking supermartingales (sRSMs)
We show that our learning procedure can successfully learn provably stabilizing policies in practice.
arXiv Detail & Related papers (2022-10-11T09:55:07Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Understanding Curriculum Learning in Policy Optimization for Online
Combinatorial Optimization [66.35750142827898]
This paper presents the first systematic study on policy optimization methods for online CO problems.
We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG)
Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift.
arXiv Detail & Related papers (2022-02-11T03:17:15Z) - Joint Differentiable Optimization and Verification for Certified
Reinforcement Learning [91.93635157885055]
In model-based reinforcement learning for safety-critical control systems, it is important to formally certify system properties.
We propose a framework that jointly conducts reinforcement learning and formal verification.
arXiv Detail & Related papers (2022-01-28T16:53:56Z) - Cautious Policy Programming: Exploiting KL Regularization in Monotonic
Policy Improvement for Reinforcement Learning [11.82492300303637]
We propose a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning.
We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
arXiv Detail & Related papers (2021-07-13T01:03:10Z) - On Imitation Learning of Linear Control Policies: Enforcing Stability
and Robustness Constraints via LMI Conditions [3.296303220677533]
We formulate the imitation learning of linear policies as a constrained optimization problem.
We show that one can guarantee the closed-loop stability and robustness by posing linear matrix inequality (LMI) constraints on the fitted policy.
arXiv Detail & Related papers (2021-03-24T02:43:03Z) - Closing the Closed-Loop Distribution Shift in Safe Imitation Learning [80.05727171757454]
We treat safe optimization-based control strategies as experts in an imitation learning problem.
We train a learned policy that can be cheaply evaluated at run-time and that provably satisfies the same safety guarantees as the expert.
arXiv Detail & Related papers (2021-02-18T05:11:41Z) - Runtime-Safety-Guided Policy Repair [13.038017178545728]
We study the problem of policy repair for learning-based control policies in safety-critical settings.
We propose to reduce or even eliminate control switching by repairing' the trained policy based on runtime data produced by the safety controller.
arXiv Detail & Related papers (2020-08-17T23:31:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.