Trusted Approximate Policy Iteration with Bisimulation Metrics
- URL: http://arxiv.org/abs/2202.02881v1
- Date: Sun, 6 Feb 2022 22:41:56 GMT
- Title: Trusted Approximate Policy Iteration with Bisimulation Metrics
- Authors: Mete Kemertas, Allan Jepson
- Abstract summary: Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences.
In this work we first prove that bisimulation metrics can be defined via any $p$-Wasserstein metric for $pgeq 1$.
We then describe an approximate policy iteration (API) procedure that uses $epsilon$-aggregation with $pi$-bisimulation and prove performance bounds for continuous state spaces.
- Score: 1.6498361958317633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bisimulation metrics define a distance measure between states of a Markov
decision process (MDP) based on a comparison of reward sequences. Due to this
property they provide theoretical guarantees in value function approximation.
In this work we first prove that bisimulation metrics can be defined via any
$p$-Wasserstein metric for $p\geq 1$. Then we describe an approximate policy
iteration (API) procedure that uses $\epsilon$-aggregation with
$\pi$-bisimulation and prove performance bounds for continuous state spaces. We
bound the difference between $\pi$-bisimulation metrics in terms of the change
in the policies themselves. Based on these theoretical results, we design an
API($\alpha$) procedure that employs conservative policy updates and enjoys
better performance bounds than the naive API approach. In addition, we propose
a novel trust region approach which circumvents the requirement to explicitly
solve a constrained optimization problem. Finally, we provide experimental
evidence of improved stability compared to non-conservative alternatives in
simulated continuous control.
Related papers
- Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - $Δ\ ext{-}{\
m OPE}$: Off-Policy Estimation with Pairs of Policies [13.528097424046823]
We introduce $Deltatext-rm OPE$ methods based on the widely used Inverse Propensity Scoring estimator.
Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
arXiv Detail & Related papers (2024-05-16T12:04:55Z) - Online non-parametric likelihood-ratio estimation by Pearson-divergence
functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time.
We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z) - Bridging the Gap between Newton-Raphson Method and Regularized Policy
Iteration [13.166738075816493]
We show that regularized policy iteration is strictly equivalent to the standard Newton-Raphson method in the condition of smoothing out Bellman equation with strongly convex functions.
We prove that regularized policy iteration has global linear convergence with the rate being $gamma$ (discount factor)
We also show that a modified version of regularized policy iteration, i.e., with finite-step policy evaluation, is equivalent to inexact Newton method where the Newton iteration formula is solved with truncated iterations.
arXiv Detail & Related papers (2023-10-11T05:55:20Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Confident Approximate Policy Iteration for Efficient Local Planning in
$q^\pi$-realizable MDPs [2.5652904661855076]
We consider approximate dynamic programming in $gamma$-discounted Markov decision processes.
Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI)
CAPI computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $epsilon$ of the action-value functions of stationary policies.
arXiv Detail & Related papers (2022-10-27T20:19:31Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.