Towards Tight Bounds on the Sample Complexity of Average-reward MDPs
- URL: http://arxiv.org/abs/2106.07046v1
- Date: Sun, 13 Jun 2021 17:18:11 GMT
- Title: Towards Tight Bounds on the Sample Complexity of Average-reward MDPs
- Authors: Yujia Jin, Aaron Sidford
- Abstract summary: We find an optimal policy of an infinite-horizon average-reward Markov decision process given access to a generative model.
We provide an algorithm that solves the problem using $widetildeO(t_mathrmmix epsilon-3)$ (oblivious) samples per state-action pair.
- Score: 39.01663172393174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We prove new upper and lower bounds for sample complexity of finding an
$\epsilon$-optimal policy of an infinite-horizon average-reward Markov decision
process (MDP) given access to a generative model. When the mixing time of the
probability transition matrix of all policies is at most $t_\mathrm{mix}$, we
provide an algorithm that solves the problem using
$\widetilde{O}(t_\mathrm{mix} \epsilon^{-3})$ (oblivious) samples per
state-action pair. Further, we provide a lower bound showing that a linear
dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm
which computes oblivious samples. We obtain our results by establishing
connections between infinite-horizon average-reward MDPs and discounted MDPs of
possible further utility.
Related papers
- Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span [16.49229317664822]
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear mixture Markov decision processes (MDPs)
Our algorithm for linear mixture MDPs achieves a nearly minimax optimal regret upper bound of $widetildemathcalO(dsqrtmathrmsp(v*)T)$ over $T$ time steps.
arXiv Detail & Related papers (2024-10-19T05:45:50Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Near-Optimal Sample Complexity Bounds for Constrained MDPs [25.509556551558834]
We provide minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP.
We show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.
arXiv Detail & Related papers (2022-06-13T15:58:14Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Efficiently Solving MDPs with Stochastic Mirror Descent [38.30919646721354]
We present a unified framework for approximately solving infinite-horizon Markov decision processes (MDPs) given a linear model.
We achieve these results through a more general mirror descent framework for solving bigenerative saddle-point problems with simplex and box domains.
arXiv Detail & Related papers (2020-08-28T17:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.