Optimal Sample Complexity of Reinforcement Learning for Mixing
Discounted Markov Decision Processes
- URL: http://arxiv.org/abs/2302.07477v3
- Date: Sat, 30 Sep 2023 13:31:31 GMT
- Title: Optimal Sample Complexity of Reinforcement Learning for Mixing
Discounted Markov Decision Processes
- Authors: Shengbo Wang, Jose Blanchet, and Peter Glynn
- Abstract summary: We consider the optimal sample complexity theory for maximizing the infinite horizon discounted reward in a Markov decision process (MDP)
In many applications of interest, the optimal policy (or all policies) induces mixing.
Our analysis is grounded in regeneration-type ideas, which we believe are of independent interest, as they can be used to study RL problems for general state space MDPs.
- Score: 1.0445957451908694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the optimal sample complexity theory of tabular reinforcement
learning (RL) for maximizing the infinite horizon discounted reward in a Markov
decision process (MDP). Optimal worst-case complexity results have been
developed for tabular RL problems in this setting, leading to a sample
complexity dependence on $\gamma$ and $\epsilon$ of the form $\tilde
\Theta((1-\gamma)^{-3}\epsilon^{-2})$, where $\gamma$ denotes the discount
factor and $\epsilon$ is the solution error tolerance. However, in many
applications of interest, the optimal policy (or all policies) induces mixing.
We establish that in such settings, the optimal sample complexity dependence is
$\tilde \Theta(t_{\text{mix}}(1-\gamma)^{-2}\epsilon^{-2})$, where
$t_{\text{mix}}$ is the total variation mixing time. Our analysis is grounded
in regeneration-type ideas, which we believe are of independent interest, as
they can be used to study RL problems for general state space MDPs.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Optimal Sample Complexity for Average Reward Markov Decision Processes [1.0445957451908694]
We develop an estimator for the optimal policy of average reward MDPs with a sample complexity of $widetilde O(|S||A|t_textmix2 epsilon-2)$.
This marks the first algorithm and analysis to reach the literature's lower bound.
arXiv Detail & Related papers (2023-10-13T03:08:59Z) - A Finite Sample Complexity Bound for Distributionally Robust Q-learning [17.96094201655567]
We consider a reinforcement learning setting in which the deployment environment is different from the training environment.
Applying a robust Markov decision processes formulation, we extend the distributionally robust $Q$-learning framework studied in Liu et al.
This is the first sample complexity result for the model-free robust RL problem.
arXiv Detail & Related papers (2023-02-26T01:15:32Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Settling the Horizon-Dependence of Sample Complexity in Reinforcement
Learning [82.31436758872715]
We develop an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions.
We establish a connection between value functions in discounted and finite-horizon Markov decision processes.
arXiv Detail & Related papers (2021-11-01T00:21:24Z) - Policy Mirror Descent for Reinforcement Learning: Linear Convergence,
New Sampling Complexity, and Generalized Problem Classes [6.240369435223]
We present new policy mirror descent (PMD) methods for solving reinforcement learning problems with strongly convex or general convex regularizers.
To the best of our knowledge, these developments appear to be new in both optimization and flexibility convexizers.
arXiv Detail & Related papers (2021-01-30T02:30:45Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.