Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2204.05275v4
- Date: Fri, 8 Mar 2024 18:40:01 GMT
- Title: Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning
- Authors: Gen Li and Laixi Shi and Yuxin Chen and Yuejie Chi and Yuting Wei
- Abstract summary: offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
- Score: 50.5790774201146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper is concerned with offline reinforcement learning (RL), which
learns using pre-collected data without further exploration. Effective offline
RL would be able to accommodate distribution shift and limited data coverage.
However, prior algorithms or analyses either suffer from suboptimal sample
complexities or incur high burn-in cost to reach sample optimality, thus posing
an impediment to efficient offline RL in sample-starved applications.
We demonstrate that the model-based (or "plug-in") approach achieves
minimax-optimal sample complexity without burn-in cost for tabular Markov
decision processes (MDPs). Concretely, consider a finite-horizon (resp.
$\gamma$-discounted infinite-horizon) MDP with $S$ states and horizon $H$
(resp. effective horizon $\frac{1}{1-\gamma}$), and suppose the distribution
shift of data is reflected by some single-policy clipped concentrability
coefficient $C^{\star}_{\text{clipped}}$. We prove that model-based offline RL
yields $\varepsilon$-accuracy with a sample complexity of \[ \begin{cases}
\frac{H^{4}SC_{\text{clipped}}^{\star}}{\varepsilon^{2}} &
(\text{finite-horizon MDPs})
\frac{SC_{\text{clipped}}^{\star}}{(1-\gamma)^{3}\varepsilon^{2}} &
(\text{infinite-horizon MDPs}) \end{cases} \] up to log factor, which is
minimax optimal for the entire $\varepsilon$-range. The proposed algorithms are
"pessimistic" variants of value iteration with Bernstein-style penalties, and
do not require sophisticated variance reduction. Our analysis framework is
established upon delicate leave-one-out decoupling arguments in conjunction
with careful self-bounding techniques tailored to MDPs.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs
with a Generative Model [3.749193647980305]
This paper considers a Markov decision process (MDP) that admits a set of state-action features.
We show that a model-based approach (resp.$$Q-learning) provably learns an $varepsilon$-optimal policy with high probability.
arXiv Detail & Related papers (2021-05-28T17:49:39Z) - Near-Optimal Offline Reinforcement Learning via Double Variance
Reduction [36.027428493021716]
Off-Policy Double Variance Reduction is a new variance reduction based algorithm for offline RL.
We show that OPDVR provably identifies an $epsilon$-optimal policy with $widetildeO(H2/d_mepsilon2)$ episodes of offline data.
We also show that OPDVR also achieves rate-optimal sample complexity under alternative settings.
arXiv Detail & Related papers (2021-02-02T20:47:35Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.