Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP
- URL: http://arxiv.org/abs/2212.00603v1
- Date: Thu, 1 Dec 2022 15:57:58 GMT
- Title: Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP
- Authors: Jinghan Wang, Mengdi Wang, Lin F. Yang
- Abstract summary: This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
- Score: 58.13930707612128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work considers the sample complexity of obtaining an
$\varepsilon$-optimal policy in an average reward Markov Decision Process
(AMDP), given access to a generative model (simulator). When the ground-truth
MDP is weakly communicating, we prove an upper bound of $\widetilde O(H
\varepsilon^{-3} \ln \frac{1}{\delta})$ samples per state-action pair, where $H
:= sp(h^*)$ is the span of bias of any optimal policy, $\varepsilon$ is the
accuracy and $\delta$ is the failure probability. This bound improves the
best-known mixing-time-based approaches in [Jin & Sidford 2021], which assume
the mixing-time of every deterministic policy is bounded. The core of our
analysis is a proper reduction bound from AMDP problems to discounted MDP
(DMDP) problems, which may be of independent interests since it allows the
application of DMDP algorithms for AMDP in other settings. We complement our
upper bound by proving a minimax lower bound of $\Omega(|\mathcal S| |\mathcal
A| H \varepsilon^{-2} \ln \frac{1}{\delta})$ total samples, showing that a
linear dependent on $H$ is necessary and that our upper bound matches the lower
bound in all parameters of $(|\mathcal S|, |\mathcal A|, H, \ln
\frac{1}{\delta})$ up to some logarithmic factors.
Related papers
- Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs [35.22742439337603]
Proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal.
For a parameterized policy class with transferred compatibility approximation error, PDR-ANPG achieves a last-iterate $epsilon$ optimality gap.
It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
arXiv Detail & Related papers (2024-08-21T10:44:57Z) - Robust Distribution Learning with Local and Global Adversarial Corruptions [17.22168727622332]
We develop an efficient finite-sample algorithm with error bounded by $sqrtvarepsilon k + rho + tildeO(dsqrtkn-1/(k lor 2))$ when $P$ has bounded covariance.
Our efficient procedure relies on a novel trace norm approximation of an ideal yet intractable 2-Wasserstein projection estimator.
arXiv Detail & Related papers (2024-06-10T17:48:36Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
For weakly communicating MDPs, we establish the complexity bound $widetildeO(SAfracHvarepsilon2 )$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2024-03-18T04:52:11Z) - Span-Based Optimal Sample Complexity for Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
We establish the complexity bound $widetildeOleft(SAfracH (1-gamma)2varepsilon2 right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2023-11-22T15:34:44Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - SDP Achieves Exact Minimax Optimality in Phase Synchronization [19.909352968029584]
We study the phase synchronization problem with noisy measurements $Y=z*z*+sigma WinmathbbCntimes ntimes n.
We prove that SDP achieves error bound $ (1+o)fracnp22np$ under a squared $ell$ loss.
arXiv Detail & Related papers (2021-01-07T03:14:05Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.