Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds
for Episodic Reinforcement Learning
- URL: http://arxiv.org/abs/2107.01264v1
- Date: Fri, 2 Jul 2021 20:36:05 GMT
- Title: Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds
for Episodic Reinforcement Learning
- Authors: Christoph Dann, Teodor V. Marinov, Mehryar Mohri, Julian Zimmert
- Abstract summary: We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes.
We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs.
- Score: 50.44564503645015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide improved gap-dependent regret bounds for reinforcement learning in
finite episodic Markov decision processes. Compared to prior work, our bounds
depend on alternative definitions of gaps. These definitions are based on the
insight that, in order to achieve a favorable regret, an algorithm does not
need to learn how to behave optimally in states that are not reached by an
optimal policy. We prove tighter upper regret bounds for optimistic algorithms
and accompany them with new information-theoretic lower bounds for a large
class of MDPs. Our results show that optimistic algorithms can not achieve the
information-theoretic lower bounds even in deterministic MDPs unless there is a
unique optimal policy.
Related papers
- Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive
Reinforcement Learning [14.036298712230233]
We study gap-dependent regret guarantees for risk-sensitive reinforcement learning based on the entropic risk measure.
We derive non-asymptotic and logarithmic regret bounds for two model-free algorithms under episodic Markov decision processes.
arXiv Detail & Related papers (2022-03-07T03:07:09Z) - A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states.
We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z) - Regret Analysis in Deterministic Reinforcement Learning [78.31410227443102]
We study the problem of regret, which is central to the analysis and design of optimal learning algorithms.
We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter.
arXiv Detail & Related papers (2021-06-27T23:41:57Z) - Best-Case Lower Bounds in Online Learning [9.01310450044549]
Much of the work in online learning focuses on the study of sublinear upper bounds on the regret.
In this work, we initiate the study of best-case lower bounds in online convex optimization.
We show that the linearized version of FTRL can attain negative linear regret.
arXiv Detail & Related papers (2021-06-23T23:24:38Z) - An Asymptotically Optimal Primal-Dual Incremental Algorithm for
Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions.
We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs.
This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.