Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage
Decomposition
- URL: http://arxiv.org/abs/2004.10019v2
- Date: Sat, 6 Jun 2020 13:35:38 GMT
- Title: Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage
Decomposition
- Authors: Zihan Zhang, Yuan Zhou, Xiangyang Ji
- Abstract summary: We study the reinforcement learning problem in finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$.
We propose a model-free algorithm UCB-Advantage and prove that it achieves $tildeO(sqrtH2SAT)$ regret where $T = KH$ and $K$ is the number of episodes to play.
- Score: 59.34067736545355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the reinforcement learning problem in the setting of finite-horizon
episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and
episode length $H$. We propose a model-free algorithm UCB-Advantage and prove
that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is
the number of episodes to play. Our regret bound improves upon the results of
[Jin et al., 2018] and matches the best known model-based algorithms as well as
the information theoretic lower bound up to logarithmic factors. We also show
that UCB-Advantage achieves low local switching cost and applies to concurrent
reinforcement learning, improving upon the recent results of [Bai et al.,
2019].
Related papers
- Optimistic Posterior Sampling for Reinforcement Learning with Few
Samples and Tight Guarantees [43.13918072870693]
We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL)
We guarantee a high-probability regret bound of order at most $widetildemathcalO(sqrtH3SAT)$ ignoring $textpolylog(HSAT)$ terms.
Our bound matches the lower bound of order $Omega(sqrtH3SAT)$, thereby answering the open problems raised by Agrawal and Jia.
arXiv Detail & Related papers (2022-09-28T20:49:34Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs
with Near-optimal Regret [44.374427255708135]
We propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs)
EE-QL assumes that an online concentrating approximation of the optimal average reward is available.
This is the first model-free learning algorithm that achieves $O(sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors.
arXiv Detail & Related papers (2020-06-08T05:09:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.