Optimistic PAC Reinforcement Learning: the Instance-Dependent View
- URL: http://arxiv.org/abs/2207.05852v1
- Date: Tue, 12 Jul 2022 21:35:03 GMT
- Title: Optimistic PAC Reinforcement Learning: the Instance-Dependent View
- Authors: Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann
- Abstract summary: We present an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available.
While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap.
In MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal.
- Score: 24.256960622176305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimistic algorithms have been extensively studied for regret minimization
in episodic tabular MDPs, both from a minimax and an instance-dependent view.
However, for the PAC RL problem, where the goal is to identify a near-optimal
policy with high probability, little is known about their instance-dependent
sample complexity. A negative result of Wagenmaker et al. (2021) suggests that
optimistic sampling rules cannot be used to attain the (still elusive) optimal
instance-dependent sample complexity. On the positive side, we provide the
first instance-dependent bound for an optimistic algorithm for PAC RL,
BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al.,
2021). While our bound features some minimal visitation probabilities, it also
features a refined notion of sub-optimality gap compared to the value gaps that
appear in prior work. Moreover, in MDPs with deterministic transitions, we show
that BPI-UCRL is actually near-optimal. On the technical side, our analysis is
very simple thanks to a new "target trick" of independent interest. We
complement these findings with a novel hardness result explaining why the
instance-dependent complexity of PAC RL cannot be easily related to that of
regret minimization, unlike in the minimax regime.
Related papers
- Faster Stochastic Variance Reduction Methods for Compositional MiniMax
Optimization [50.10952609321302]
compositional minimax optimization is a pivotal challenge across various machine learning domains.
Current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes.
This paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity of $O(kappa3 /epsilon3 )$.
arXiv Detail & Related papers (2023-08-18T14:57:21Z) - Asymptotically Unbiased Instance-wise Regularized Partial AUC
Optimization: Theory and Algorithm [101.44676036551537]
One-way Partial AUC (OPAUC) and Two-way Partial AUC (TPAUC) measures the average performance of a binary classifier.
Most of the existing methods could only optimize PAUC approximately, leading to inevitable biases that are not controllable.
We present a simpler reformulation of the PAUC problem via distributional robust optimization AUC.
arXiv Detail & Related papers (2022-10-08T08:26:22Z) - Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via
Online Experiment Design [12.056495277232118]
This work seeks to understand the "instance-dependent" complexity of learning near-optimal policies.
We propose an algorithm, textscPedel, which achieves a fine-grained instance-dependent measure of complexity.
We show that textscPedel yields provable gains over low-regret, minimax-optimal algorithms.
arXiv Detail & Related papers (2022-07-06T10:42:57Z) - Near-Optimal Sample Complexity Bounds for Constrained MDPs [25.509556551558834]
We provide minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP.
We show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.
arXiv Detail & Related papers (2022-06-13T15:58:14Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Beyond No Regret: Instance-Dependent PAC Reinforcement Learning [29.552894877883883]
We show that there exists a tradeoff between achieving low regret and identifying an $epsilon$-optimal policy at the instance-optimal rate.
We propose and analyze a novel, planning-based algorithm which attains this sample complexity.
We show that our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
arXiv Detail & Related papers (2021-08-05T16:34:17Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.