The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches
- URL: http://arxiv.org/abs/2203.01491v1
- Date: Thu, 3 Mar 2022 02:55:55 GMT
- Title: The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches
- Authors: Grigoris Velegkas, Zhuoran Yang, Amin Karbasi
- Abstract summary: We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
- Score: 84.54669549718075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the problem of regret minimization for episodic
Reinforcement Learning (RL) both in the model-free and the model-based setting.
We focus on learning with general function classes and general model classes,
and we derive results that scale with the eluder dimension of these classes. In
contrast to the existing body of work that mainly establishes
instance-independent regret guarantees, we focus on the instance-dependent
setting and show that the regret scales logarithmically with the horizon $T$,
provided that there is a gap between the best and the second best action in
every state. In addition, we show that such a logarithmic regret bound is
realizable by algorithms with $O(\log T)$ switching cost (also known as
adaptivity complexity). In other words, these algorithms rarely switch their
policy during the course of their execution. Finally, we complement our results
with lower bounds which show that even in the tabular setting, we cannot hope
for regret guarantees lower than $o(\log T)$.
Related papers
- Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Generalized Bandit Regret Minimizer Framework in Imperfect Information
Extensive-Form Game [9.933208900617174]
We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG.
To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $ellt$ by $v(zt)$ and minimize the regret.
We present a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(sqrtX B/T+sqrtY C/T)$ to $O(
arXiv Detail & Related papers (2022-03-11T13:45:42Z) - First-Order Regret in Reinforcement Learning with Linear Function
Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces.
We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z) - Near-Optimal No-Regret Learning for Correlated Equilibria in
Multi-Player General-Sum Games [104.74734408204749]
We show that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(textrmpolylog(T))$ after $T$ repetitions of the game.
We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium.
arXiv Detail & Related papers (2021-11-11T01:19:53Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage
Decomposition [59.34067736545355]
We study the reinforcement learning problem in finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$.
We propose a model-free algorithm UCB-Advantage and prove that it achieves $tildeO(sqrtH2SAT)$ regret where $T = KH$ and $K$ is the number of episodes to play.
arXiv Detail & Related papers (2020-04-21T14:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.