On Query-efficient Planning in MDPs under Linear Realizability of the
Optimal State-value Function
- URL: http://arxiv.org/abs/2102.02049v2
- Date: Thu, 4 Feb 2021 13:53:07 GMT
- Title: On Query-efficient Planning in MDPs under Linear Realizability of the
Optimal State-value Function
- Authors: Gellert Weisz, Philip Amortila, Barnab\'as Janzer, Yasin
Abbasi-Yadkori, Nan Jiang, Csaba Szepesv\'ari
- Abstract summary: We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model.
A recent lower bound established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries.
In this work, we establish that poly$(H, d)$ learning is possible (with state value function realizability) whenever the action set is small.
- Score: 14.205660708980988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of local planning in fixed-horizon Markov Decision
Processes (MDPs) with a generative model under the assumption that the optimal
value function lies in the span of a feature map that is accessible through the
generative model. As opposed to previous work where linear realizability of all
policies was assumed, we consider the significantly relaxed assumption of a
single linearly realizable (deterministic) policy. A recent lower bound
established that the related problem when the action-value function of the
optimal policy is linearly realizable requires an exponential number of
queries, either in H (the horizon of the MDP) or d (the dimension of the
feature mapping). Their construction crucially relies on having an
exponentially large action set. In contrast, in this work, we establish that
poly$(H, d)$ learning is possible (with state value function realizability)
whenever the action set is small (i.e. O(1)). In particular, we present the
TensorPlan algorithm which uses poly$((dH/\delta)^A)$ queries to find a
$\delta$-optimal policy relative to any deterministic policy for which the
value function is linearly realizable with a parameter from a fixed radius ball
around zero. This is the first algorithm to give a polynomial query complexity
guarantee using only linear-realizability of a single competing value function.
Whether the computation cost is similarly bounded remains an interesting open
question. The upper bound is complemented by a lower bound which proves that in
the infinite-horizon episodic setting, planners that achieve constant
suboptimality need exponentially many queries, either in the dimension or the
number of actions.
Related papers
- Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - TensorPlan and the Few Actions Lower Bound for Planning in MDPs under
Linear Realizability of Optimal Value Functions [0.0]
We consider the minimax query complexity of online planning with a generative model in fixed-horizon decision processes.
We show that an exponentially large lower bound holds when $A=Omega(min(d1/4,H1/2)).
arXiv Detail & Related papers (2021-10-05T17:42:46Z) - Efficient Local Planning with Linear Function Approximation [27.90696655434707]
We study query and computationally efficient planning algorithms with linear function approximation and a simulator.
We propose an algorithm named confident Monte Carlo least square policy iteration (Confident MC-LSPI) for this setting.
One technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy algorithm.
arXiv Detail & Related papers (2021-08-12T04:56:33Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.