Logarithmic regret for episodic continuous-time linear-quadratic
reinforcement learning over a finite-time horizon
- URL: http://arxiv.org/abs/2006.15316v4
- Date: Fri, 17 Jun 2022 18:48:26 GMT
- Title: Logarithmic regret for episodic continuous-time linear-quadratic
reinforcement learning over a finite-time horizon
- Authors: Matteo Basei, Xin Guo, Anran Hu, Yufei Zhang
- Abstract summary: We study continuous-time linear-quadratic reinforcement learning problems in an episodic setting.
We propose a least-squares algorithm based on continuous-time observations and controls.
- Score: 7.123160883637873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study finite-time horizon continuous-time linear-quadratic reinforcement
learning problems in an episodic setting, where both the state and control
coefficients are unknown to the controller. We first propose a least-squares
algorithm based on continuous-time observations and controls, and establish a
logarithmic regret bound of order $O((\ln M)(\ln\ln M))$, with $M$ being the
number of learning episodes. The analysis consists of two parts: perturbation
analysis, which exploits the regularity and robustness of the associated
Riccati differential equation; and parameter estimation error, which relies on
sub-exponential properties of continuous-time least-squares estimators. We
further propose a practically implementable least-squares algorithm based on
discrete-time observations and piecewise constant controls, which achieves
similar logarithmic regret with an additional term depending explicitly on the
time stepsizes used in the algorithm.
Related papers
- Learning Unstable Continuous-Time Stochastic Linear Control Systems [0.0]
We study the problem of system identification for continuous-time dynamics, based on a single finite-length state trajectory.
We present a method for estimating the possibly unstable open-loop matrix by employing properly randomized control inputs.
We establish theoretical performance guarantees showing that the estimation error decays with trajectory length, a measure of excitability, and the signal-to-noise ratio.
arXiv Detail & Related papers (2024-09-17T16:24:51Z) - Graph Spatiotemporal Process for Multivariate Time Series Anomaly
Detection with Missing Values [67.76168547245237]
We introduce a novel framework called GST-Pro, which utilizes a graphtemporal process and anomaly scorer to detect anomalies.
Our experimental results show that the GST-Pro method can effectively detect anomalies in time series data and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-01-11T10:10:16Z) - Square-root regret bounds for continuous-time episodic Markov decision
processes [11.585113506994471]
We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting.
We present a learning algorithm based on the methods of iteration value and upper confidence bound.
arXiv Detail & Related papers (2022-10-03T11:35:07Z) - Semi-supervised Learning of Partial Differential Operators and Dynamical
Flows [68.77595310155365]
We present a novel method that combines a hyper-network solver with a Fourier Neural Operator architecture.
We test our method on various time evolution PDEs, including nonlinear fluid flows in one, two, and three spatial dimensions.
The results show that the new method improves the learning accuracy at the time point of supervision point, and is able to interpolate and the solutions to any intermediate time.
arXiv Detail & Related papers (2022-07-28T19:59:14Z) - Logarithmic regret bounds for continuous-time average-reward Markov decision processes [9.806527798032809]
We consider reinforcement learning for continuous-time decision processes (MDPs) in the infinite-horizon, average-reward setting.
We derive instance-dependent regret lower bounds that are logarithmic in the time horizon.
arXiv Detail & Related papers (2022-05-23T10:15:00Z) - Online estimation and control with optimal pathlength regret [52.28457815067461]
A natural goal when designing online learning algorithms is to bound the regret of the algorithm in terms of the temporal variation of the input sequence.
Data-dependent "pathlength" regret bounds have recently been obtained for a wide variety of online learning problems, including OCO and bandits.
arXiv Detail & Related papers (2021-10-24T22:43:15Z) - The Connection between Discrete- and Continuous-Time Descriptions of
Gaussian Continuous Processes [60.35125735474386]
We show that discretizations yielding consistent estimators have the property of invariance under coarse-graining'
This result explains why combining differencing schemes for derivatives reconstruction and local-in-time inference approaches does not work for time series analysis of second or higher order differential equations.
arXiv Detail & Related papers (2021-01-16T17:11:02Z) - A Constraint-Based Algorithm for the Structural Learning of
Continuous-Time Bayesian Networks [70.88503833248159]
We propose the first constraint-based algorithm for learning the structure of continuous-time Bayesian networks.
We discuss the different statistical tests and the underlying hypotheses used by our proposal to establish conditional independence.
arXiv Detail & Related papers (2020-07-07T07:34:09Z) - Stochastic Approximation with Markov Noise: Analysis and applications in
reinforcement learning [0.0]
We present for the first time an convergence analysis of two time-scale approximation driven by "controlled" Markov noise.
We analyze the behavior of our framework by relating it to limiting differential inclusions in both time scales.
We obtain the first informative error bounds on function approximation for the policy evaluation algorithm.
arXiv Detail & Related papers (2020-04-08T03:59:21Z) - Efficient improper learning for online logistic regression [68.8204255655161]
It is known that any proper algorithm which has logarithmic regret in the number of samples (denoted n) necessarily suffers an exponential multiplicative constant in B.
In this work, we design an efficient improper algorithm that avoids this exponential constant while preserving a logarithmic regret.
Our new algorithm based on regularized empirical risk minimization with surrogate losses satisfies a regret scaling as O(B log(Bn)) with a per-round time-complexity of order O(d2)
arXiv Detail & Related papers (2020-03-18T09:16:14Z) - Convergence and sample complexity of gradient methods for the model-free
linear quadratic regulator problem [27.09339991866556]
We show that ODE searches for optimal control for an unknown computation system by directly searching over the corresponding space of controllers.
We take a step towards demystifying the performance and efficiency of such methods by focusing on the gradient-flow dynamics set of stabilizing feedback gains and a similar result holds for the forward disctization of the ODE.
arXiv Detail & Related papers (2019-12-26T16:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.