Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon
Reinforcement Learning?
- URL: http://arxiv.org/abs/2005.00527v2
- Date: Thu, 9 Jul 2020 16:20:06 GMT
- Title: Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon
Reinforcement Learning?
- Authors: Ruosong Wang, Simon S. Du, Lin F. Yang, Sham M. Kakade
- Abstract summary: Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems.
We show that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense.
- Score: 108.94173231481355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to plan for long horizons is a central challenge in episodic
reinforcement learning problems. A fundamental question is to understand how
the difficulty of the problem scales as the horizon increases. Here the natural
measure of sample complexity is a normalized one: we are interested in the
number of episodes it takes to provably discover a policy whose value is
$\varepsilon$ near to that of the optimal value, where the value is measured by
the normalized cumulative reward in each episode. In a COLT 2018 open problem,
Jiang and Agarwal conjectured that, for tabular, episodic reinforcement
learning problems, there exists a sample complexity lower bound which exhibits
a polynomial dependence on the horizon -- a conjecture which is consistent with
all known sample complexity upper bounds. This work refutes this conjecture,
proving that tabular, episodic reinforcement learning is possible with a sample
complexity that scales only logarithmically with the planning horizon. In other
words, when the values are appropriately normalized (to lie in the unit
interval), this results shows that long horizon RL is no more difficult than
short horizon RL, at least in a minimax sense. Our analysis introduces two
ideas: (i) the construction of an $\varepsilon$-net for optimal policies whose
log-covering number scales only logarithmically with the planning horizon, and
(ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all
policies in a given policy class using sample complexity that scales with the
log-covering number of the given policy class. Both may be of independent
interest.
Related papers
- Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement
Learning with General Function Approximation [26.277745106128197]
We propose an algorithm to tackle long planning horizon problems in reinforcement learning with general function approximation.
The derived regret bound is deemed emphsharp as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors.
The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs and (ii) fine-grained analyses.
arXiv Detail & Related papers (2023-12-07T17:35:34Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Lower Bounds for Learning in Revealing POMDPs [88.23337313766355]
This paper studies the fundamental limits of reinforcement learning (RL) in the challenging emphpartially observable setting.
For emphmulti-step revealing POMDPs, we show that the latent state-space dependence is at least $Omega(S1.5)$ in the sample complexity.
arXiv Detail & Related papers (2023-02-02T18:59:30Z) - Provable Reinforcement Learning with a Short-Term Memory [68.00677878812908]
We study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$.
In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially.
Our results show that a short-term memory suffices for reinforcement learning in these environments.
arXiv Detail & Related papers (2022-02-08T16:39:57Z) - Settling the Horizon-Dependence of Sample Complexity in Reinforcement
Learning [82.31436758872715]
We develop an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions.
We establish a connection between value functions in discounted and finite-horizon Markov decision processes.
arXiv Detail & Related papers (2021-11-01T00:21:24Z) - Online Sparse Reinforcement Learning [60.44832065993122]
We investigate the hardness of online reinforcement learning in fixed horizon, sparse linear decision process (MDP)
We show that linear regret is generally unavoidable in this case, even if there exists a policy that collects well-conditioned data.
This shows that in the large-action setting, the difficulty of learning can be attributed to the difficulty of finding a good exploratory policy.
arXiv Detail & Related papers (2020-11-08T16:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.