Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via
Online Experiment Design
- URL: http://arxiv.org/abs/2207.02575v2
- Date: Thu, 20 Jul 2023 12:59:44 GMT
- Title: Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via
Online Experiment Design
- Authors: Andrew Wagenmaker, Kevin Jamieson
- Abstract summary: This work seeks to understand the "instance-dependent" complexity of learning near-optimal policies.
We propose an algorithm, textscPedel, which achieves a fine-grained instance-dependent measure of complexity.
We show that textscPedel yields provable gains over low-regret, minimax-optimal algorithms.
- Score: 12.056495277232118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While much progress has been made in understanding the minimax sample
complexity of reinforcement learning (RL) -- the complexity of learning on the
"worst-case" instance -- such measures of complexity often do not capture the
true difficulty of learning. In practice, on an "easy" instance, we might hope
to achieve a complexity far better than that achievable on the worst-case
instance. In this work we seek to understand the "instance-dependent"
complexity of learning near-optimal policies (PAC RL) in the setting of RL with
linear function approximation. We propose an algorithm, \textsc{Pedel}, which
achieves a fine-grained instance-dependent measure of complexity, the first of
its kind in the RL with function approximation setting, thereby capturing the
difficulty of learning on each particular problem instance. Through an explicit
example, we show that \textsc{Pedel} yields provable gains over low-regret,
minimax-optimal algorithms and that such algorithms are unable to hit the
instance-optimal rate. Our approach relies on a novel online experiment
design-based procedure which focuses the exploration budget on the "directions"
most relevant to learning a near-optimal policy, and may be of independent
interest.
Related papers
- Faster Stochastic Variance Reduction Methods for Compositional MiniMax
Optimization [50.10952609321302]
compositional minimax optimization is a pivotal challenge across various machine learning domains.
Current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes.
This paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity of $O(kappa3 /epsilon3 )$.
arXiv Detail & Related papers (2023-08-18T14:57:21Z) - Optimistic PAC Reinforcement Learning: the Instance-Dependent View [24.256960622176305]
We present an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available.
While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap.
In MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal.
arXiv Detail & Related papers (2022-07-12T21:35:03Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Beyond No Regret: Instance-Dependent PAC Reinforcement Learning [29.552894877883883]
We show that there exists a tradeoff between achieving low regret and identifying an $epsilon$-optimal policy at the instance-optimal rate.
We propose and analyze a novel, planning-based algorithm which attains this sample complexity.
We show that our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
arXiv Detail & Related papers (2021-08-05T16:34:17Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Instance-Dependent Complexity of Contextual Bandits and Reinforcement
Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm.
We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds.
We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - On the Sample Complexity of Reinforcement Learning with Policy Space
Generalization [21.879621917722613]
We study the optimal sample complexity in large-scale Reinforcement Learning (RL) problems with policy space generalization.
Existing results show that without a generalization model, the sample complexity of an RL algorithm will inevitably depend on the cardinalities of state space and action space.
This paper proposes a new notion of eluder dimension for the policy space, which characterizes the intrinsic complexity of policy learning.
arXiv Detail & Related papers (2020-08-17T14:26:18Z) - Provably Efficient Exploration for Reinforcement Learning Using
Unsupervised Learning [96.78504087416654]
Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems, we investigate when this paradigm is provably efficient.
We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a noregret tabular RL algorithm.
arXiv Detail & Related papers (2020-03-15T19:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.