Settling the Horizon-Dependence of Sample Complexity in Reinforcement
Learning
- URL: http://arxiv.org/abs/2111.00633v1
- Date: Mon, 1 Nov 2021 00:21:24 GMT
- Title: Settling the Horizon-Dependence of Sample Complexity in Reinforcement
Learning
- Authors: Yuanzhi Li, Ruosong Wang, Lin F. Yang
- Abstract summary: We develop an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions.
We establish a connection between value functions in discounted and finite-horizon Markov decision processes.
- Score: 82.31436758872715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently there is a surge of interest in understanding the horizon-dependence
of the sample complexity in reinforcement learning (RL). Notably, for an RL
environment with horizon length $H$, previous work have shown that there is a
probably approximately correct (PAC) algorithm that learns an $O(1)$-optimal
policy using $\mathrm{polylog}(H)$ episodes of environment interactions when
the number of states and actions is fixed. It is yet unknown whether the
$\mathrm{polylog}(H)$ dependence is necessary or not. In this work, we resolve
this question by developing an algorithm that achieves the same PAC guarantee
while using only $O(1)$ episodes of environment interactions, completely
settling the horizon-dependence of the sample complexity in RL. We achieve this
bound by (i) establishing a connection between value functions in discounted
and finite-horizon Markov decision processes (MDPs) and (ii) a novel
perturbation analysis in MDPs. We believe our new techniques are of independent
interest and could be applied in related questions in RL.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Optimal Sample Complexity of Reinforcement Learning for Mixing
Discounted Markov Decision Processes [1.0445957451908694]
We consider the optimal sample complexity theory for maximizing the infinite horizon discounted reward in a Markov decision process (MDP)
In many applications of interest, the optimal policy (or all policies) induces mixing.
Our analysis is grounded in regeneration-type ideas, which we believe are of independent interest, as they can be used to study RL problems for general state space MDPs.
arXiv Detail & Related papers (2023-02-15T05:43:17Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Towards Tight Bounds on the Sample Complexity of Average-reward MDPs [39.01663172393174]
We find an optimal policy of an infinite-horizon average-reward Markov decision process given access to a generative model.
We provide an algorithm that solves the problem using $widetildeO(t_mathrmmix epsilon-3)$ (oblivious) samples per state-action pair.
arXiv Detail & Related papers (2021-06-13T17:18:11Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.