Stronger Regret Bounds for Safe Online Reinforcement Learning in the Linear Quadratic Regulator
- URL: http://arxiv.org/abs/2410.21081v1
- Date: Mon, 28 Oct 2024 14:46:14 GMT
- Title: Stronger Regret Bounds for Safe Online Reinforcement Learning in the Linear Quadratic Regulator
- Authors: Benjamin Schiffer, Lucas Janson,
- Abstract summary: We study Linear Quadratic Regulator (LQR) learning with unknown dynamics.
We show the first $tildeO_T(sqrtT)$-regret bound for constrained LQR learning.
An overarching theme of our results is that enforcing safety provides "free exploration"
- Score: 5.108909395876561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many practical applications of online reinforcement learning require the satisfaction of safety constraints while learning about the unknown environment. In this work, we study Linear Quadratic Regulator (LQR) learning with unknown dynamics, but with the additional constraint that the position must stay within a safe region for the entire trajectory with high probability. Unlike in previous works, we allow for both bounded and unbounded noise distributions and study stronger baselines of nonlinear controllers that are better suited for constrained problems than linear controllers. Due to these complications, we focus on 1-dimensional state- and action- spaces, however we also discuss how we expect the high-level takeaways can generalize to higher dimensions. Our primary contribution is the first $\tilde{O}_T(\sqrt{T})$-regret bound for constrained LQR learning, which we show relative to a specific baseline of non-linear controllers. We then prove that, for any non-linear baseline satisfying natural assumptions, $\tilde{O}_T(\sqrt{T})$-regret is possible when the noise distribution has sufficiently large support and $\tilde{O}_T(T^{2/3})$-regret is possible for any subgaussian noise distribution. An overarching theme of our results is that enforcing safety provides "free exploration" that compensates for the added cost of uncertainty in safety constrained control, resulting in the same regret rate as in the unconstrained problem.
Related papers
- Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: $\sqrt{T}$-Regret [5.108909395876561]
We prove rigorous regret bounds for safety-constrained reinforcement learning.
We provide the first safe algorithm that achieves regret of $tildeO_T(sqrtT)$.
arXiv Detail & Related papers (2025-04-25T19:22:57Z) - Regret Bounds for Episodic Risk-Sensitive Linear Quadratic Regulator [5.445357652101423]
Risk-sensitive linear quadratic regulator is one of the most fundamental problems in risk-sensitive optimal control.
We propose a simple least-squares greedy algorithm and show that it achieves $widetildemathcalO(log N)$ regret.
This is the first set of regret bounds for episodic risk-sensitive linear quadratic regulator.
arXiv Detail & Related papers (2024-06-08T06:06:20Z) - Adaptive Robust Model Predictive Control via Uncertainty Cancellation [25.736296938185074]
We propose a learning-based robust predictive control algorithm that compensates for significant uncertainty in the dynamics.
We optimize over a class of nonlinear feedback policies inspired by certainty equivalent "estimate-and-cancel" control laws.
arXiv Detail & Related papers (2022-12-02T18:54:23Z) - Learning Control Policies for Stochastic Systems with Reach-avoid
Guarantees [20.045860624444494]
We study the problem of learning controllers for discrete-time non-linear dynamical systems with formal reach-avoid guarantees.
We learn a certificate in the form of a reach-avoid supermartingale (RASM), a novel notion that we introduce in this work.
Our approach solves several important problems -- it can be used to learn a control policy from scratch, to verify a reach-avoid specification for a fixed control policy, or to fine-tune a pre-trained policy.
arXiv Detail & Related papers (2022-10-11T10:02:49Z) - Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement
Learning in Unknown Stochastic Environments [84.3830478851369]
We propose a safe reinforcement learning approach that can jointly learn the environment and optimize the control policy.
Our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.
arXiv Detail & Related papers (2022-09-29T20:49:25Z) - Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers.
We then present the pointwise feasibility conditions of the resulting safety controller.
We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - First-Order Regret in Reinforcement Learning with Linear Function
Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces.
We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z) - Safe Adaptive Learning-based Control for Constrained Linear Quadratic
Regulators with Regret Guarantees [11.627320138064684]
We study the adaptive control of an unknown linear system with a quadratic cost function subject to safety constraints on both the states and actions.
Our algorithm is implemented on a single trajectory and does not require system restarts.
arXiv Detail & Related papers (2021-10-31T05:52:42Z) - Learning Barrier Certificates: Towards Safe Reinforcement Learning with
Zero Training-time Violations [64.39401322671803]
This paper explores the possibility of safe RL algorithms with zero training-time safety violations.
We propose an algorithm, Co-trained Barrier Certificate for Safe RL (CRABS), which iteratively learns barrier certificates, dynamics models, and policies.
arXiv Detail & Related papers (2021-08-04T04:59:05Z) - On the Stability of Nonlinear Receding Horizon Control: A Geometric
Perspective [72.7951562665449]
widespread adoption of nonlinear Receding Control (RHC) strategies by industry has to more than 30 years.
This paper takes the first step towards understanding the role of global geometry in the role of global-based control.
arXiv Detail & Related papers (2021-03-27T22:59:37Z) - Safe Learning of Uncertain Environments for Nonlinear Control-Affine
Systems [10.918870296899245]
We consider the problem of safe learning in nonlinear control-affine systems subject to unknown additive uncertainty.
We model uncertainty as a Gaussian signal and use state measurements to learn its mean and covariance bounds.
We show that with an arbitrarily large probability we can guarantee that the state will remain in the safe set, while learning and control are carried out simultaneously.
arXiv Detail & Related papers (2021-03-02T01:58:02Z) - Closing the Closed-Loop Distribution Shift in Safe Imitation Learning [80.05727171757454]
We treat safe optimization-based control strategies as experts in an imitation learning problem.
We train a learned policy that can be cheaply evaluated at run-time and that provably satisfies the same safety guarantees as the expert.
arXiv Detail & Related papers (2021-02-18T05:11:41Z) - Derivative-Free Policy Optimization for Risk-Sensitive and Robust
Control Design: Implicit Regularization and Sample Complexity [15.940861063732608]
Direct policy search serves as one of the workhorses in modern reinforcement learning (RL)
We investigate the convergence theory of policy robustness (PG) methods for the linear risk-sensitive and robust controller.
One feature of our algorithms is that during the learning phase, a certain level complexity/risk-sensitivity controller is preserved.
arXiv Detail & Related papers (2021-01-04T16:00:46Z) - Information Theoretic Regret Bounds for Online Nonlinear Control [35.534829914047336]
We study the problem of sequential control in an unknown, nonlinear dynamical system.
This framework yields a general setting that permits discrete and continuous control inputs as well as non-smooth, non-differentiable dynamics.
We empirically show its application to a number of nonlinear control tasks and demonstrate the benefit of exploration for learning model dynamics.
arXiv Detail & Related papers (2020-06-22T17:46:48Z) - Adaptive Control and Regret Minimization in Linear Quadratic Gaussian
(LQG) Setting [91.43582419264763]
We propose LqgOpt, a novel reinforcement learning algorithm based on the principle of optimism in the face of uncertainty.
LqgOpt efficiently explores the system dynamics, estimates the model parameters up to their confidence interval, and deploys the controller of the most optimistic model.
arXiv Detail & Related papers (2020-03-12T19:56:38Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Regret Minimization in Partially Observable Linear Quadratic Control [91.43582419264763]
We study the problem of regret in partially observable linear quadratic control systems when the model dynamics are unknown a priori.
We propose a novel way to decompose the regret and provide an end-to-end sublinear regret upper bound for partially observable linear quadratic control.
arXiv Detail & Related papers (2020-01-31T22:35:08Z) - Constrained Upper Confidence Reinforcement Learning [12.919486518128734]
This paper extends upper confidence reinforcement learning for settings in which the reward function and the constraints, described by cost functions, are unknown a priori.
We show that an algorithm C-UCRL achieves sub-linear regret with respect to the reward while satisfying the constraints even while learning with probability $1-delta$.
arXiv Detail & Related papers (2020-01-26T00:23:02Z) - Improper Learning for Non-Stochastic Control [78.65807250350755]
We consider the problem of controlling a possibly unknown linear dynamical system with adversarial perturbations, adversarially chosen convex loss functions, and partially observed states.
Applying online descent to this parametrization yields a new controller which attains sublinear regret vs. a large class of closed-loop policies.
Our bounds are the first in the non-stochastic control setting that compete with emphall stabilizing linear dynamical controllers.
arXiv Detail & Related papers (2020-01-25T02:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.