Related papers: Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

URL: http://arxiv.org/abs/2207.02575v2
Date: Thu, 20 Jul 2023 12:59:44 GMT
Title: Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design
Authors: Andrew Wagenmaker, Kevin Jamieson
Abstract summary: This work seeks to understand the "instance-dependent" complexity of learning near-optimal policies. We propose an algorithm, textscPedel, which achieves a fine-grained instance-dependent measure of complexity. We show that textscPedel yields provable gains over low-regret, minimax-optimal algorithms.
Score: 12.056495277232118
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.

Related papers

Faster Stochastic Variance Reduction Methods for Compositional MiniMax Optimization [50.10952609321302]
compositional minimax optimization is a pivotal challenge across various machine learning domains. Current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes. This paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity of $O(kappa3 /epsilon3 )$.
arXiv Detail & Related papers (2023-08-18T14:57:21Z)
Optimistic PAC Reinforcement Learning: the Instance-Dependent View [24.256960622176305]
We present an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available. While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap. In MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal.
arXiv Detail & Related papers (2022-07-12T21:35:03Z)
Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes. Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z)
Beyond No Regret: Instance-Dependent PAC Reinforcement Learning [29.552894877883883]
We show that there exists a tradeoff between achieving low regret and identifying an $epsilon$-optimal policy at the instance-optimal rate. We propose and analyze a novel, planning-based algorithm which attains this sample complexity. We show that our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
arXiv Detail & Related papers (2021-08-05T16:34:17Z)
Online Sub-Sampling for Reinforcement Learning with General Function Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm. For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times. In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)
Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
On the Sample Complexity of Reinforcement Learning with Policy Space Generalization [21.879621917722613]
We study the optimal sample complexity in large-scale Reinforcement Learning (RL) problems with policy space generalization. Existing results show that without a generalization model, the sample complexity of an RL algorithm will inevitably depend on the cardinalities of state space and action space. This paper proposes a new notion of eluder dimension for the policy space, which characterizes the intrinsic complexity of policy learning.
arXiv Detail & Related papers (2020-08-17T14:26:18Z)
Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity [58.70807593332932]
We study oracle complexity of gradient based methods for approximation problems. We focus on instance-dependent complexity instead of worst case complexity. Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation.
arXiv Detail & Related papers (2020-06-08T09:25:47Z)
Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning [96.78504087416654]
Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems, we investigate when this paradigm is provably efficient. We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a noregret tabular RL algorithm.
arXiv Detail & Related papers (2020-03-15T19:23:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.