Efficiently Solving MDPs with Stochastic Mirror Descent
- URL: http://arxiv.org/abs/2008.12776v1
- Date: Fri, 28 Aug 2020 17:58:40 GMT
- Title: Efficiently Solving MDPs with Stochastic Mirror Descent
- Authors: Yujia Jin and Aaron Sidford
- Abstract summary: We present a unified framework for approximately solving infinite-horizon Markov decision processes (MDPs) given a linear model.
We achieve these results through a more general mirror descent framework for solving bigenerative saddle-point problems with simplex and box domains.
- Score: 38.30919646721354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a unified framework based on primal-dual stochastic mirror descent
for approximately solving infinite-horizon Markov decision processes (MDPs)
given a generative model. When applied to an average-reward MDP with $A_{tot}$
total state-action pairs and mixing time bound $t_{mix}$ our method computes an
$\epsilon$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot}
\epsilon^{-2})$ samples from the state-transition matrix, removing the
ergodicity dependence of prior art. When applied to a $\gamma$-discounted MDP
with $A_{tot}$ total state-action pairs our method computes an
$\epsilon$-optimal policy with an expected $\widetilde{O}((1-\gamma)^{-4}
A_{tot} \epsilon^{-2})$ samples, matching the previous state-of-the-art up to a
$(1-\gamma)^{-1}$ factor. Both methods are model-free, update state values and
policies simultaneously, and run in time linear in the number of samples taken.
We achieve these results through a more general stochastic mirror descent
framework for solving bilinear saddle-point problems with simplex and box
domains and we demonstrate the flexibility of this framework by providing
further applications to constrained MDPs.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
For weakly communicating MDPs, we establish the complexity bound $widetildeO(SAfracHvarepsilon2 )$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2024-03-18T04:52:11Z) - Span-Based Optimal Sample Complexity for Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
We establish the complexity bound $widetildeOleft(SAfracH (1-gamma)2varepsilon2 right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2023-11-22T15:34:44Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Towards Tight Bounds on the Sample Complexity of Average-reward MDPs [39.01663172393174]
We find an optimal policy of an infinite-horizon average-reward Markov decision process given access to a generative model.
We provide an algorithm that solves the problem using $widetildeO(t_mathrmmix epsilon-3)$ (oblivious) samples per state-action pair.
arXiv Detail & Related papers (2021-06-13T17:18:11Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.