FORESEE: Prediction with Expansion-Compression Unscented Transform for
Online Policy Optimization
- URL: http://arxiv.org/abs/2209.12644v2
- Date: Thu, 1 Feb 2024 02:14:20 GMT
- Title: FORESEE: Prediction with Expansion-Compression Unscented Transform for
Online Policy Optimization
- Authors: Hardik Parwana and Dimitra Panagou
- Abstract summary: We introduce a method for state prediction, called the Expansion-Compression Unscented Transform, to solve a class of online policy optimization problems.
Our proposed algorithm propagates a finite number of sigma points through a state-dependent distribution, which dictates an increase in the number of sigma points at each time step.
Its performance is empirically shown to be comparable to Monte Carlo but at a much lower computational cost.
- Score: 8.97438370260135
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Propagating state distributions through a generic, uncertain nonlinear
dynamical model is known to be intractable and usually begets numerical or
analytical approximations. We introduce a method for state prediction, called
the Expansion-Compression Unscented Transform, and use it to solve a class of
online policy optimization problems. Our proposed algorithm propagates a finite
number of sigma points through a state-dependent distribution, which dictates
an increase in the number of sigma points at each time step to represent the
resulting distribution; this is what we call the expansion operation. To keep
the algorithm scalable, we augment the expansion operation with a compression
operation based on moment matching, thereby keeping the number of sigma points
constant across predictions over multiple time steps. Its performance is
empirically shown to be comparable to Monte Carlo but at a much lower
computational cost. Under state and control input constraints, the state
prediction is subsequently used in tandem with a proposed variant of
constrained gradient-descent for online update of policy parameters in a
receding horizon fashion. The framework is implemented as a differentiable
computational graph for policy training. We showcase our framework for a
quadrotor stabilization task as part of a benchmark comparison in
safe-control-gym and for optimizing the parameters of a Control Barrier
Function based controller in a leader-follower problem.
Related papers
- Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Convergence of policy gradient methods for finite-horizon exploratory
linear-quadratic control problems [3.8661825615213012]
We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems.
We propose a novel PG method with discretetime policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies.
arXiv Detail & Related papers (2022-11-01T17:31:41Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Structural Estimation of Markov Decision Processes in High-Dimensional
State Space with Finite-Time Guarantees [39.287388288477096]
We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states.
This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized.
We propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces.
arXiv Detail & Related papers (2022-10-04T00:11:38Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - A Convergent and Dimension-Independent Min-Max Optimization Algorithm [32.492526162436405]
We show that a distribution which the min-player uses to update its parameters depends on a smooth and bounded nonnon-concave function.
Our algorithm converges to an approximate local equilibrium in a number of iterations that does not depend on the iterations.
arXiv Detail & Related papers (2020-06-22T16:11:30Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.