Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal
Algorithm Escaping the Curse of Horizon
- URL: http://arxiv.org/abs/2009.13503v2
- Date: Wed, 30 Jun 2021 07:06:10 GMT
- Title: Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal
Algorithm Escaping the Curse of Horizon
- Authors: Zihan Zhang, Xiangyang Ji, Simon S. Du
- Abstract summary: Episodic reinforcement learning generalizes contextual bandits.
Long planning horizon and unknown state-dependent transitions pose little additional difficulty on sample complexity.
We show MVP enjoys an $left(left(sqrtSAK + S2Aright)$ regret, approaching the $Omegaleft(sqrtSAK + S2Aright)$ lower bound of emphcontextual bandits up to logarithmic terms.
- Score: 88.75843804630772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Episodic reinforcement learning and contextual bandits are two widely studied
sequential decision-making problems. Episodic reinforcement learning
generalizes contextual bandits and is often perceived to be more difficult due
to long planning horizon and unknown state-dependent transitions. The current
paper shows that the long planning horizon and the unknown state-dependent
transitions (at most) pose little additional difficulty on sample complexity.
We consider the episodic reinforcement learning with $S$ states, $A$ actions,
planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$
episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue
\textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus.
Compared to existing bonus constructions, the new bonus is tighter since it is
based on a well-designed monotonic value function. In particular, the
\emph{constants} in the bonus should be subtly setting to ensure optimism and
monotonicity.
We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log
\left(SAHK\right)\right)$ regret, approaching the
$\Omega\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits} up to
logarithmic terms. Notably, this result 1) \emph{exponentially} improves the
state-of-the-art polynomial-time algorithms by Dann et al. [2019] and Zanette
et al. [2019] in terms of the dependency on $H$, and 2) \emph{exponentially}
improves the running time in [Wang et al. 2020] and significantly improves the
dependency on $S$, $A$ and $K$ in sample complexity.
Related papers
- Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Uncoupled Learning Dynamics with $O(\log T)$ Swap Regret in Multiplayer
Games [121.50979258049135]
We show that when all players follow our dynamics with a emphtime-invariant learning rate, the emphsecond-order path lengths of the dynamics up to time $T$ are bounded by $O(log T)$.
Our proposed learning dynamics combine in a novel way emphoptimistic regularized learning with the use of emphself-concordant barriers.
arXiv Detail & Related papers (2022-04-25T03:20:16Z) - Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies [88.75843804630772]
We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds.
Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
arXiv Detail & Related papers (2022-03-24T08:14:12Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - Improved Optimistic Algorithms for Logistic Bandits [16.140301473601454]
We propose a new optimistic algorithm based on a finer examination of the non-linearities of the reward function.
We show that it enjoys a $tildemathcalO(sqrtT)$ regret with no dependency in $kappa$, but for a second order term.
arXiv Detail & Related papers (2020-02-18T12:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.