Related papers: FORESEE: Prediction with Expansion-Compression Unscented Transform for Online Policy Optimization

FORESEE: Prediction with Expansion-Compression Unscented Transform for Online Policy Optimization

URL: http://arxiv.org/abs/2209.12644v2
Date: Thu, 1 Feb 2024 02:14:20 GMT
Title: FORESEE: Prediction with Expansion-Compression Unscented Transform for Online Policy Optimization
Authors: Hardik Parwana and Dimitra Panagou
Abstract summary: We introduce a method for state prediction, called the Expansion-Compression Unscented Transform, to solve a class of online policy optimization problems. Our proposed algorithm propagates a finite number of sigma points through a state-dependent distribution, which dictates an increase in the number of sigma points at each time step. Its performance is empirically shown to be comparable to Monte Carlo but at a much lower computational cost.
Score: 8.97438370260135
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Propagating state distributions through a generic, uncertain nonlinear dynamical model is known to be intractable and usually begets numerical or analytical approximations. We introduce a method for state prediction, called the Expansion-Compression Unscented Transform, and use it to solve a class of online policy optimization problems. Our proposed algorithm propagates a finite number of sigma points through a state-dependent distribution, which dictates an increase in the number of sigma points at each time step to represent the resulting distribution; this is what we call the expansion operation. To keep the algorithm scalable, we augment the expansion operation with a compression operation based on moment matching, thereby keeping the number of sigma points constant across predictions over multiple time steps. Its performance is empirically shown to be comparable to Monte Carlo but at a much lower computational cost. Under state and control input constraints, the state prediction is subsequently used in tandem with a proposed variant of constrained gradient-descent for online update of policy parameters in a receding horizon fashion. The framework is implemented as a differentiable computational graph for policy training. We showcase our framework for a quadrotor stabilization task as part of a benchmark comparison in safe-control-gym and for optimizing the parameters of a Control Barrier Function based controller in a leader-follower problem.

Related papers

MDP Planning as Policy Inference [1.7259824817932294]
We cast episodic Markov decision process (MDP) planning as Bayesian inference over _policies_.<n>A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return.<n>Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic.
arXiv Detail & Related papers (2026-02-19T13:56:31Z)
Bayesian Risk-Sensitive Policy Optimization For MDPs With General Loss Functions [8.16996766356341]
We consider Markov decision processes (MDPs) with a general loss function and unknown parameters.<n>We take a Bayesian approach to estimate the parameters from data and impose a coherent risk functional on the loss.<n>We propose a policy gradient optimization method, leveraging the dual representation of coherent risk measures.
arXiv Detail & Related papers (2025-09-19T01:16:59Z)
Gaussian Process Policy Iteration with Additive Schwarz Acceleration for Forward and Inverse HJB and Mean Field Game Problems [0.0]
We propose a policy iteration framework for solving Hamilton--Jacobi--Bellman equations and mean field games.<n>Policy iteration is formulated as an alternating procedure between solving the value function under a fixed control policy and updating the policy based on the resulting value function.<n>To improve convergence, we incorporate the additive Schwarz acceleration as a preconditioning step following each policy update.
arXiv Detail & Related papers (2025-05-01T23:04:52Z)
Performative Reinforcement Learning with Linear Markov Decision Process [14.75815792682734]
We study the setting of emphperformative reinforcement learning where the deployed policy affects both the reward and the transition of the underlying Markov decision process. We generalize the results to emphlinear Markov decision processes which is the primary theoretical model of large-scale MDPs.
arXiv Detail & Related papers (2024-11-07T23:04:48Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data. Our analysis allows for the sampling of entire episodes, as is common practice in most applications. Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z)
Convergence of policy gradient methods for finite-horizon exploratory linear-quadratic control problems [3.8661825615213012]
We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. We propose a novel PG method with discretetime policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies.
arXiv Detail & Related papers (2022-11-01T17:31:41Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees [39.287388288477096]
We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. We propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces.
arXiv Detail & Related papers (2022-10-04T00:11:38Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z)
Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)
Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs) Semi-implicit actor (SIA) powered by a flexible policy distribution. We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.