An Exponential Lower Bound for Linearly-Realizable MDPs with Constant
Suboptimality Gap
- URL: http://arxiv.org/abs/2103.12690v1
- Date: Tue, 23 Mar 2021 17:05:54 GMT
- Title: An Exponential Lower Bound for Linearly-Realizable MDPs with Constant
Suboptimality Gap
- Authors: Yuanhao Wang, Ruosong Wang, Sham M. Kakade
- Abstract summary: We show that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed.
Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting.
- Score: 66.75488143823337
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A fundamental question in the theory of reinforcement learning is: suppose
the optimal $Q$-function lies in the linear span of a given $d$ dimensional
feature mapping, is sample-efficient reinforcement learning (RL) possible? The
recent and remarkable result of Weisz et al. (2020) resolved this question in
the negative, providing an exponential (in $d$) sample size lower bound, which
holds even if the agent has access to a generative model of the environment.
One may hope that this information theoretic barrier for RL can be circumvented
by further supposing an even more favorable assumption: there exists a
\emph{constant suboptimality gap} between the optimal $Q$-value of the best
action and that of the second-best action (for all states). The hope is that
having a large suboptimality gap would permit easier identification of optimal
actions themselves, thus making the problem tractable; indeed, provided the
agent has access to a generative model, sample-efficient RL is in fact possible
with the addition of this more favorable assumption.
This work focuses on this question in the standard online reinforcement
learning setting, where our main result resolves this question in the negative:
our hardness result shows that an exponential sample complexity lower bound
still holds even if a constant suboptimality gap is assumed in addition to
having a linearly realizable optimal $Q$-function. Perhaps surprisingly, this
implies an exponential separation between the online RL setting and the
generative model setting. Complementing our negative hardness result, we give
two positive results showing that provably sample-efficient RL is possible
either under an additional low-variance assumption or under a novel
hypercontractivity assumption (both implicitly place stronger conditions on the
underlying dynamics model).
Related papers
- Achieving the Asymptotically Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach [36.88301225561535]
offline reinforcement learning aims to learn from pre-collected datasets without active exploration.
Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively.
We show that the distributionally robust optimization (DRO) based approach can also address these challenges and is asymptotically minimax optimal
arXiv Detail & Related papers (2023-05-22T17:50:18Z) - KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal [70.15267479220691]
We consider and analyze the sample complexity of model reinforcement learning with a generative variance-free model.
Our analysis shows that it is nearly minimax-optimal for finding an $varepsilon$-optimal policy when $varepsilon$ is sufficiently small.
arXiv Detail & Related papers (2022-05-27T19:39:24Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Sample-Efficient Reinforcement Learning Is Feasible for Linearly
Realizable MDPs with Limited Revisiting [60.98700344526674]
Low-complexity models such as linear function representation play a pivotal role in enabling sample-efficient reinforcement learning.
In this paper, we investigate a new sampling protocol, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states in a controlled and infrequent manner.
We develop an algorithm tailored to this setting, achieving a sample complexity that scales practicallyly with the feature dimension, the horizon, and the inverse sub-optimality gap, but not the size of the state/action space.
arXiv Detail & Related papers (2021-05-17T17:22:07Z) - Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement
Learning? [30.065091907118827]
This work considers sample complexity of finding an $epsilon$-optimal policy in a Markov decision process (MDP)
We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver.
We show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-10-12T13:13:01Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Towards Tractable Optimism in Model-Based Reinforcement Learning [37.51073590932658]
To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error)
We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP.
We show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.
arXiv Detail & Related papers (2020-06-21T20:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.