Related papers: Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

URL: http://arxiv.org/abs/2203.12922v1
Date: Thu, 24 Mar 2022 08:14:12 GMT
Title: Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies
Authors: Zihan Zhang, Xiangyang Ji, Simon S. Du
Abstract summary: We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
Score: 88.75843804630772
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $S$~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

Related papers

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model [3.10770247120758]
We analyze the sample complexities of learning the optimal state-action value function $Q*$ and an optimal policy $pi*$ in a discounted Markov decision process (MDP)<n>We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI)
arXiv Detail & Related papers (2025-05-30T22:27:57Z)
Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z)
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs) Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$. We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z)
Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon [88.75843804630772]
Episodic reinforcement learning generalizes contextual bandits. Long planning horizon and unknown state-dependent transitions pose little additional difficulty on sample complexity. We show MVP enjoys an $left(left(sqrtSAK + S2Aright)$ regret, approaching the $Omegaleft(sqrtSAK + S2Aright)$ lower bound of emphcontextual bandits up to logarithmic terms.
arXiv Detail & Related papers (2020-09-28T17:52:32Z)
A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints [8.849815837266977]
Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning.
arXiv Detail & Related papers (2020-09-23T19:30:46Z)
Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints [38.2783003051101]
This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied.
arXiv Detail & Related papers (2020-03-11T23:23:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.