Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes
- URL: http://arxiv.org/abs/2306.16394v1
- Date: Wed, 28 Jun 2023 17:43:19 GMT
- Title: Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes
- Authors: Zihan Zhang and Qiaomin Xie
- Abstract summary: We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
- Score: 21.77276136591518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop several provably efficient model-free reinforcement learning (RL)
algorithms for infinite-horizon average-reward Markov Decision Processes
(MDPs). We consider both online setting and the setting with access to a
simulator. In the online setting, we propose model-free RL algorithms based on
reference-advantage decomposition. Our algorithm achieves
$\widetilde{O}(S^5A^2\mathrm{sp}(h^*)\sqrt{T})$ regret after $T$ steps, where
$S\times A$ is the size of state-action space, and
$\mathrm{sp}(h^*)$ the span of the optimal bias function. Our results are the
first to achieve optimal dependence in $T$ for weakly communicating MDPs.
In the simulator setting, we propose a model-free RL algorithm that finds an
$\epsilon$-optimal policy using $\widetilde{O}
\left(\frac{SA\mathrm{sp}^2(h^*)}{\epsilon^2}+\frac{S^2A\mathrm{sp}(h^*)}{\epsilon}
\right)$ samples, whereas the minimax lower bound is
$\Omega\left(\frac{SA\mathrm{sp}(h^*)}{\epsilon^2}\right)$.
Our results are based on two new techniques that are unique in the
average-reward setting: 1) better discounted approximation by value-difference
estimation; 2) efficient construction of confidence region for the optimal bias
function with space complexity $O(SA)$.
Related papers
- A Proximal Modified Quasi-Newton Method for Nonsmooth Regularized Optimization [0.7373617024876725]
Lipschitz-of-$nabla f$.
$mathcalS_k|p$.
$mathcalS_k|p$.
$nabla f$.
$mathcalS_k|p$.
arXiv Detail & Related papers (2024-09-28T18:16:32Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Near-Optimal Bounds for Learning Gaussian Halfspaces with Random
Classification Noise [50.64137465792738]
We show that any efficient SQ algorithm for the problem requires sample complexity at least $Omega(d1/2/(maxp, epsilon)2)$.
Our lower bound suggests that this quadratic dependence on $1/epsilon$ is inherent for efficient algorithms.
arXiv Detail & Related papers (2023-07-13T18:59:28Z) - An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs
with a Generative Model [3.749193647980305]
This paper considers a Markov decision process (MDP) that admits a set of state-action features.
We show that a model-based approach (resp.$$Q-learning) provably learns an $varepsilon$-optimal policy with high probability.
arXiv Detail & Related papers (2021-05-28T17:49:39Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.