Confident Approximate Policy Iteration for Efficient Local Planning in
$q^\pi$-realizable MDPs
- URL: http://arxiv.org/abs/2210.15755v1
- Date: Thu, 27 Oct 2022 20:19:31 GMT
- Title: Confident Approximate Policy Iteration for Efficient Local Planning in
$q^\pi$-realizable MDPs
- Authors: Gell\'ert Weisz and Andr\'as Gy\"orgy and Tadashi Kozuno and Csaba
Szepesv\'ari
- Abstract summary: We consider approximate dynamic programming in $gamma$-discounted Markov decision processes.
Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI)
CAPI computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $epsilon$ of the action-value functions of stationary policies.
- Score: 2.5652904661855076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider approximate dynamic programming in $\gamma$-discounted Markov
decision processes and apply it to approximate planning with linear
value-function approximation. Our first contribution is a new variant of
Approximate Policy Iteration (API), called Confident Approximate Policy
Iteration (CAPI), which computes a deterministic stationary policy with an
optimal error bound scaling linearly with the product of the effective horizon
$H$ and the worst-case approximation error $\epsilon$ of the action-value
functions of stationary policies. This improvement over API (whose error scales
with $H^2$) comes at the price of an $H$-fold increase in memory cost. Unlike
Scherrer and Lesner [2012], who recommended computing a non-stationary policy
to achieve a similar improvement (with the same memory overhead), we are able
to stick to stationary policies. This allows for our second contribution, the
application of CAPI to planning with local access to a simulator and
$d$-dimensional linear function approximation. As such, we design a planning
algorithm that applies CAPI to obtain a sequence of policies with successively
refined accuracies on a dynamically evolving set of states. The algorithm
outputs an $\tilde O(\sqrt{d}H\epsilon)$-optimal policy after issuing $\tilde
O(dH^4/\epsilon^2)$ queries to the simulator, simultaneously achieving the
optimal accuracy bound and the best known query complexity bound, while earlier
algorithms in the literature achieve only one of them. This query complexity is
shown to be tight in all parameters except $H$. These improvements come at the
expense of a mild (polynomial) increase in memory and computational costs of
both the algorithm and its output policy.
Related papers
- Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization [11.11876897168701]
We propose an algorithm that achieves a regret bound of order $tildemathcalO(mathrmpoly(H)sqrtSAT)$.
The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures.
arXiv Detail & Related papers (2024-07-08T08:06:45Z) - Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Optimistic Policy Optimization is Provably Efficient in Non-stationary
MDPs [45.6318149525364]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs)
We propose the $underlinetextp$eriodically $underlinetextr$estarted $underlinetexto$ptimistic $underlinetextp$olicy $underlinetexto$ptimization algorithm (PROPO)
arXiv Detail & Related papers (2021-10-18T02:33:20Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.