Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies
- URL: http://arxiv.org/abs/2203.12922v1
- Date: Thu, 24 Mar 2022 08:14:12 GMT
- Title: Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies
- Authors: Zihan Zhang, Xiangyang Ji, Simon S. Du
- Abstract summary: We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds.
Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
- Score: 88.75843804630772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper gives the first polynomial-time algorithm for tabular Markov
Decision Processes (MDP) that enjoys a regret bound \emph{independent on the
planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$
actions, a planning horizon $H$, total reward bounded by $1$, and the agent
plays for $K$ episodes. We design an algorithm that achieves an
$O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to
existing bounds which either has an additional $\mathrm{polylog}(H)$
dependency~\citep{zhang2020reinforcement} or has an exponential dependency on
$S$~\citep{li2021settling}. Our result relies on a sequence of new structural
lemmas establishing the approximation power, stability, and concentration
property of stationary policies, which can have applications in other problems
related to Markov chains.
Related papers
- Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal
Algorithm Escaping the Curse of Horizon [88.75843804630772]
Episodic reinforcement learning generalizes contextual bandits.
Long planning horizon and unknown state-dependent transitions pose little additional difficulty on sample complexity.
We show MVP enjoys an $left(left(sqrtSAK + S2Aright)$ regret, approaching the $Omegaleft(sqrtSAK + S2Aright)$ lower bound of emphcontextual bandits up to logarithmic terms.
arXiv Detail & Related papers (2020-09-28T17:52:32Z) - A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with
Constraints [8.849815837266977]
Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems.
We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning.
arXiv Detail & Related papers (2020-09-23T19:30:46Z) - Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints [38.2783003051101]
This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1.
We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied.
arXiv Detail & Related papers (2020-03-11T23:23:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.