Global Convergence of Receding-Horizon Policy Search in Learning
Estimator Designs
- URL: http://arxiv.org/abs/2309.04831v1
- Date: Sat, 9 Sep 2023 16:03:49 GMT
- Title: Global Convergence of Receding-Horizon Policy Search in Learning
Estimator Designs
- Authors: Xiangyuan Zhang, Saviz Mowlavi, Mouhacine Benosman, Tamer Ba\c{s}ar
- Abstract summary: We introduce the receding-horizon policy estimator (RHPG) algorithm.
RHPG is the first algorithm with provable global convergence in learning optimal linear policy estimator.
- Score: 3.0811185425377743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the receding-horizon policy gradient (RHPG) algorithm, the first
PG algorithm with provable global convergence in learning the optimal linear
estimator designs, i.e., the Kalman filter (KF). Notably, the RHPG algorithm
does not require any prior knowledge of the system for initialization and does
not require the target system to be open-loop stable. The key of RHPG is that
we integrate vanilla PG (or any other policy search directions) into a dynamic
programming outer loop, which iteratively decomposes the infinite-horizon KF
problem that is constrained and non-convex in the policy parameter into a
sequence of static estimation problems that are unconstrained and
strongly-convex, thus enabling global convergence. We further provide
fine-grained analyses of the optimization landscape under RHPG and detail the
convergence and sample complexity guarantees of the algorithm. This work serves
as an initial attempt to develop reinforcement learning algorithms specifically
for control applications with performance guarantees by utilizing classic
control theory in both algorithmic design and theoretical analyses. Lastly, we
validate our theories by deploying the RHPG algorithm to learn the Kalman
filter design of a large-scale convection-diffusion model. We open-source the
code repository at \url{https://github.com/xiangyuan-zhang/LearningKF}.
Related papers
- Full error analysis of policy gradient learning algorithms for exploratory linear quadratic mean-field control problem in continuous time with common noise [0.0]
We study policy gradient (PG) learning and first demonstrate convergence in a model-based setting.
We prove the global linear convergence and sample complexity of the PG algorithm with two-point gradient estimates in a model-free setting.
In this setting, the parameterized optimal policies are learned from samples of the states and population distribution.
arXiv Detail & Related papers (2024-08-05T14:11:51Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - e-COP : Episodic Constrained Optimization of Policies [12.854752753529151]
We present the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings.
We show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting.
arXiv Detail & Related papers (2024-06-13T20:12:09Z) - Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs [9.58750210024265]
We consider (stochastic) softmax policy gradient (PG) methods for bandits and Markov decision processes (MDPs)
We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities.
For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise.
arXiv Detail & Related papers (2024-05-21T18:12:39Z) - GQFedWAvg: Optimization-Based Quantized Federated Learning in General
Edge Computing Systems [11.177402054314674]
The optimal implementation of federated learning (FL) in practical edge computing has been an outstanding problem.
We propose an optimization quantized FL algorithm, which can appropriately fit a general edge computing system with uniform or nonuniform computing and communication systems.
arXiv Detail & Related papers (2023-06-13T02:18:24Z) - Revisiting LQR Control from the Perspective of Receding-Horizon Policy
Gradient [2.1756081703276]
We revisit the discrete-time linear quadratic regulator (LQR) problem from the perspective of receding-horizon policy gradient (RHPG)
We provide a fine-grained sample analysis for G to learn a control policy that is both stabilizing and $epsilon-close to the optimal LQR solution.
arXiv Detail & Related papers (2023-02-25T19:16:40Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.