Achieving the Minimax Optimal Sample Complexity of Offline Reinforcement
Learning: A DRO-Based Approach
- URL: http://arxiv.org/abs/2305.13289v3
- Date: Sun, 3 Dec 2023 23:00:11 GMT
- Title: Achieving the Minimax Optimal Sample Complexity of Offline Reinforcement
Learning: A DRO-Based Approach
- Authors: Yue Wang, Jinjun Xiong, Shaofeng Zou
- Abstract summary: offline reinforcement learning aims to learn from pre-collected datasets without active exploration.
Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs.
We show that the distributionally robust optimization approach can also address these challenges and is minimax optimal.
- Score: 41.45280954901834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning aims to learn from pre-collected datasets
without active exploration. This problem faces significant challenges,
including limited data availability and distributional shifts. Existing
approaches adopt a pessimistic stance towards uncertainty by penalizing rewards
of under-explored state-action pairs to estimate value functions
conservatively. In this paper, we show that the distributionally robust
optimization (DRO) based approach can also address these challenges and is
minimax optimal. Specifically, we directly model the uncertainty in the
transition kernel and construct an uncertainty set of statistically plausible
transition kernels. We then find the policy that optimizes the worst-case
performance over this uncertainty set. We first design a metric-based
Hoeffding-style uncertainty set such that with high probability the true
transition kernel is in this set. We prove that to achieve a sub-optimality gap
of $\epsilon$, the sample complexity is
$\mathcal{O}(S^2C^{\pi^*}\epsilon^{-2}(1-\gamma)^{-4})$, where $\gamma$ is the
discount factor, $S$ is the number of states, and $C^{\pi^*}$ is the
single-policy clipped concentrability coefficient which quantifies the
distribution shift. To achieve the optimal sample complexity, we further
propose a less conservative Bernstein-style uncertainty set, which, however,
does not necessarily include the true transition kernel. We show that an
improved sample complexity of
$\mathcal{O}(SC^{\pi^*}\epsilon^{-2}(1-\gamma)^{-3})$ can be obtained, which
matches with the minimax lower bound for offline reinforcement learning, and
thus is minimax optimal.
Related papers
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Improved Sample Complexity Bounds for Distributionally Robust
Reinforcement Learning [3.222802562733787]
We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment.
We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences.
We show that our algorithm achieves $tildemathcalO(|mathcalSmathcalA| H5)$ sample complexity, which is uniformly better than the existing results.
arXiv Detail & Related papers (2023-03-05T21:47:08Z) - A Finite Sample Complexity Bound for Distributionally Robust Q-learning [16.88062487980405]
We consider a reinforcement learning setting in which the deployment environment is different from the training environment.
Applying a robust Markov decision processes formulation, we extend the distributionally robust $Q$-learning framework studied in Liu et al.
This is the first sample complexity result for the model-free robust RL problem.
arXiv Detail & Related papers (2023-02-26T01:15:32Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions.
We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism.
We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Non-convex Distributionally Robust Optimization: Non-asymptotic Analysis [16.499651513178012]
Distributionally robust optimization (DRO) is a widely-used approach to learn models that are robust against distribution shift.
We provide non-asymptotic convergence guarantees even though the objective function is possibly prominent nonsmooth- and has normalized gradient descent.
arXiv Detail & Related papers (2021-10-24T14:56:38Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.