Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable
Optimal Action-Value Functions
- URL: http://arxiv.org/abs/2010.01374v2
- Date: Mon, 15 Feb 2021 22:44:00 GMT
- Title: Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable
Optimal Action-Value Functions
- Authors: Gell\'ert Weisz, Philip Amortila, Csaba Szepesv\'ari
- Abstract summary: We consider the problem of local planning in fixed-horizon and discounted Markov Decision Processes.
We show that any sound planner must query at least $min(exp(Omega(d)), Omega (2H))$ samples in the fized-horizon setting and $exp(Omega(d))$ samples in the discounted setting.
- Score: 3.2980012891975754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of local planning in fixed-horizon and discounted
Markov Decision Processes (MDPs) with linear function approximation and a
generative model under the assumption that the optimal action-value function
lies in the span of a feature map that is available to the planner. Previous
work has left open the question of whether there exist sound planners that need
only poly(H,d) queries regardless of the MDP, where H is the horizon and d is
the dimensionality of the features. We answer this question in the negative: we
show that any sound planner must query at least $\min(\exp({\Omega}(d)),
{\Omega}(2^H))$ samples in the fized-horizon setting and $\exp({\Omega}(d))$
samples in the discounted setting. We also show that for any ${\delta}>0$, the
least-squares value iteration algorithm with $O(H^5d^{H+1}/{\delta}^2)$ queries
can compute a ${\delta}$-optimal policy in the fixed-horizon setting. We
discuss implications and remaining open questions.
Related papers
- Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
We study model-based reinforcement learning with non-linear function approximation.
We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings.
arXiv Detail & Related papers (2024-06-19T15:29:14Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - TensorPlan and the Few Actions Lower Bound for Planning in MDPs under
Linear Realizability of Optimal Value Functions [0.0]
We consider the minimax query complexity of online planning with a generative model in fixed-horizon decision processes.
We show that an exponentially large lower bound holds when $A=Omega(min(d1/4,H1/2)).
arXiv Detail & Related papers (2021-10-05T17:42:46Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - On Query-efficient Planning in MDPs under Linear Realizability of the
Optimal State-value Function [14.205660708980988]
We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model.
A recent lower bound established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries.
In this work, we establish that poly$(H, d)$ learning is possible (with state value function realizability) whenever the action set is small.
arXiv Detail & Related papers (2021-02-03T13:23:15Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with
Constraints [8.849815837266977]
Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems.
We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning.
arXiv Detail & Related papers (2020-09-23T19:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.