Optimal Sample Complexity for Average Reward Markov Decision Processes
- URL: http://arxiv.org/abs/2310.08833v2
- Date: Mon, 12 Feb 2024 19:06:53 GMT
- Title: Optimal Sample Complexity for Average Reward Markov Decision Processes
- Authors: Shengbo Wang, Jose Blanchet, and Peter Glynn
- Abstract summary: We develop an estimator for the optimal policy of average reward MDPs with a sample complexity of $widetilde O(|S||A|t_textmix2 epsilon-2)$.
This marks the first algorithm and analysis to reach the literature's lower bound.
- Score: 1.0445957451908694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We resolve the open question regarding the sample complexity of policy
learning for maximizing the long-run average reward associated with a uniformly
ergodic Markov decision process (MDP), assuming a generative model. In this
context, the existing literature provides a sample complexity upper bound of
$\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$ and a lower bound of
$\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$. In these expressions, $|S|$ and
$|A|$ denote the cardinalities of the state and action spaces respectively,
$t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing
times, and $\epsilon$ signifies the error tolerance. Therefore, a notable gap
of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is
the development of an estimator for the optimal policy of average reward MDPs
with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2})$.
This marks the first algorithm and analysis to reach the literature's lower
bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin
and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical
experiments to validate our theoretical findings.
Related papers
- Finding good policies in average-reward Markov Decision Processes without prior knowledge [19.89784209009327]
We revisit the identification of an $varepsilon$-optimal policy in average-reward Decision (MDP)
By relying on a diameter estimation procedure, we propose the first algorithm for $(varepsilon,delta)$-PAC-PAC policy identification.
arXiv Detail & Related papers (2024-05-27T12:24:14Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
For weakly communicating MDPs, we establish the complexity bound $widetildeO(SAfracHvarepsilon2 )$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2024-03-18T04:52:11Z) - Stochastic Approximation Approaches to Group Distributionally Robust
Optimization [96.26317627118912]
Group distributionally robust optimization (GDRO)
Online learning techniques to reduce the number of samples required in each round from $m$ to $1$, keeping the same sample.
A novel formulation of weighted GDRO, which allows us to derive distribution-dependent convergence rates.
arXiv Detail & Related papers (2023-02-18T09:24:15Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Statistically Near-Optimal Hypothesis Selection [33.83129262033921]
We derive an optimal $2$-approximation learning strategy for the Hypothesis Selection problem.
This is the first algorithm that simultaneously achieves the best approximation factor and sample complexity.
arXiv Detail & Related papers (2021-08-17T21:11:20Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.