Beyond No Regret: Instance-Dependent PAC Reinforcement Learning
- URL: http://arxiv.org/abs/2108.02717v1
- Date: Thu, 5 Aug 2021 16:34:17 GMT
- Title: Beyond No Regret: Instance-Dependent PAC Reinforcement Learning
- Authors: Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson
- Abstract summary: We show that there exists a tradeoff between achieving low regret and identifying an $epsilon$-optimal policy at the instance-optimal rate.
We propose and analyze a novel, planning-based algorithm which attains this sample complexity.
We show that our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
- Score: 29.552894877883883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The theory of reinforcement learning has focused on two fundamental problems:
achieving low regret, and identifying $\epsilon$-optimal policies. While a
simple reduction allows one to apply a low-regret algorithm to obtain an
$\epsilon$-optimal policy and achieve the worst-case optimal rate, it is
unknown whether low-regret algorithms can obtain the instance-optimal rate for
policy identification. We show that this is not possible -- there exists a
fundamental tradeoff between achieving low regret and identifying an
$\epsilon$-optimal policy at the instance-optimal rate.
Motivated by our negative finding, we propose a new measure of
instance-dependent sample complexity for PAC tabular reinforcement learning
which explicitly accounts for the attainable state visitation distributions in
the underlying MDP. We then propose and analyze a novel, planning-based
algorithm which attains this sample complexity -- yielding a complexity which
scales with the suboptimality gaps and the ``reachability'' of a state. We show
that our algorithm is nearly minimax optimal, and on several examples that our
instance-dependent sample complexity offers significant improvements over
worst-case bounds.
Related papers
- Towards Instance-Optimality in Online PAC Reinforcement Learning [28.156332484814616]
We propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy.
We demonstrate that the sample complexity of the PEDEL algorithm of citeWagenmaker22linearMDP closely approaches this lower bound.
arXiv Detail & Related papers (2023-10-31T19:26:36Z) - Faster Stochastic Variance Reduction Methods for Compositional MiniMax
Optimization [50.10952609321302]
compositional minimax optimization is a pivotal challenge across various machine learning domains.
Current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes.
This paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity of $O(kappa3 /epsilon3 )$.
arXiv Detail & Related papers (2023-08-18T14:57:21Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Optimistic PAC Reinforcement Learning: the Instance-Dependent View [24.256960622176305]
We present an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available.
While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap.
In MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal.
arXiv Detail & Related papers (2022-07-12T21:35:03Z) - Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via
Online Experiment Design [12.056495277232118]
This work seeks to understand the "instance-dependent" complexity of learning near-optimal policies.
We propose an algorithm, textscPedel, which achieves a fine-grained instance-dependent measure of complexity.
We show that textscPedel yields provable gains over low-regret, minimax-optimal algorithms.
arXiv Detail & Related papers (2022-07-06T10:42:57Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Beyond Worst-Case Analysis in Stochastic Approximation: Moment
Estimation Improves Instance Complexity [58.70807593332932]
We study oracle complexity of gradient based methods for approximation problems.
We focus on instance-dependent complexity instead of worst case complexity.
Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation.
arXiv Detail & Related papers (2020-06-08T09:25:47Z) - The Simulator: Understanding Adaptive Sampling in the
Moderate-Confidence Regime [52.38455827779212]
We propose a novel technique for analyzing adaptive sampling called the em Simulator.
We prove the first instance-based lower bounds the top-k problem which incorporate the appropriate log-factors.
Our new analysis inspires a simple and near-optimal for the best-arm and top-k identification, the first em practical of its kind for the latter problem.
arXiv Detail & Related papers (2017-02-16T23:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.