Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning
- URL: http://arxiv.org/abs/2110.04645v1
- Date: Sat, 9 Oct 2021 21:13:48 GMT
- Title: Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning
- Authors: Gen Li, Laixi Shi, Yuxin Chen, Yuantao Gu, Yuejie Chi
- Abstract summary: A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
- Score: 52.76230802067506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving sample efficiency in online episodic reinforcement learning (RL)
requires optimally balancing exploration and exploitation. When it comes to a
finite-horizon episodic Markov decision process with $S$ states, $A$ actions
and horizon length $H$, substantial progress has been achieved towards
characterizing the minimax-optimal regret, which scales on the order of
$\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples.
While several competing solution paradigms have been proposed to minimize
regret, they are either memory-inefficient, or fall short of optimality unless
the sample size exceeds an enormous threshold (e.g., $S^6A^4
\,\mathrm{poly}(H)$ for existing model-free methods).
To overcome such a large sample size barrier to efficient RL, we design a
novel model-free algorithm, with space complexity $O(SAH)$, that achieves
near-optimal regret as soon as the sample size exceeds the order of
$SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred
to the initial burn-in cost),
our method improves -- by at least a factor of $S^5A^3$ -- upon any prior
memory-efficient algorithm that is asymptotically regret-optimal. Leveraging
the recently introduced variance reduction strategy (also called {\em
reference-advantage decomposition}), the proposed algorithm employs an {\em
early-settled} reference update rule, with the aid of two Q-learning sequences
with upper and lower confidence bounds. The design principle of our
early-settled variance reduction method might be of independent interest to
other RL settings that involve intricate exploration-exploitation trade-offs.
Related papers
- Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms.
Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return.
We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement
Learning? [30.065091907118827]
This work considers sample complexity of finding an $epsilon$-optimal policy in a Markov decision process (MDP)
We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver.
We show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-10-12T13:13:01Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.