Sample-Efficient Reinforcement Learning Is Feasible for Linearly
Realizable MDPs with Limited Revisiting
- URL: http://arxiv.org/abs/2105.08024v1
- Date: Mon, 17 May 2021 17:22:07 GMT
- Title: Sample-Efficient Reinforcement Learning Is Feasible for Linearly
Realizable MDPs with Limited Revisiting
- Authors: Gen Li, Yuxin Chen, Yuejie Chi, Yuantao Gu, Yuting Wei
- Abstract summary: Low-complexity models such as linear function representation play a pivotal role in enabling sample-efficient reinforcement learning.
In this paper, we investigate a new sampling protocol, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states in a controlled and infrequent manner.
We develop an algorithm tailored to this setting, achieving a sample complexity that scales practicallyly with the feature dimension, the horizon, and the inverse sub-optimality gap, but not the size of the state/action space.
- Score: 60.98700344526674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low-complexity models such as linear function representation play a pivotal
role in enabling sample-efficient reinforcement learning (RL). The current
paper pertains to a scenario with value-based linear representation, which
postulates the linear realizability of the optimal Q-function (also called the
"linear $Q^{\star}$ problem"). While linear realizability alone does not allow
for sample-efficient solutions in general, the presence of a large
sub-optimality gap is a potential game changer, depending on the sampling
mechanism in use. Informally, sample efficiency is achievable with a large
sub-optimality gap when a generative model is available but is unfortunately
infeasible when we turn to standard online RL settings.
In this paper, we make progress towards understanding this linear $Q^{\star}$
problem by investigating a new sampling protocol, which draws samples in an
online/exploratory fashion but allows one to backtrack and revisit previous
states in a controlled and infrequent manner. This protocol is more flexible
than the standard online RL setting, while being practically relevant and far
more restrictive than the generative model. We develop an algorithm tailored to
this setting, achieving a sample complexity that scales polynomially with the
feature dimension, the horizon, and the inverse sub-optimality gap, but not the
size of the state/action space. Our findings underscore the fundamental
interplay between sampling protocols and low-complexity structural
representation in RL.
Related papers
- Offline RL via Feature-Occupancy Gradient Ascent [9.983014605039658]
We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs)
We develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies.
We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees.
arXiv Detail & Related papers (2024-05-22T15:39:05Z) - Sample Complexity of Kernel-Based Q-Learning [11.32718794195643]
We propose a non-parametric Q-learning algorithm which finds an $epsilon$-optimal policy in an arbitrarily large scale discounted MDP.
To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
arXiv Detail & Related papers (2023-02-01T19:46:25Z) - Non-Linear Reinforcement Learning in Large Action Spaces: Structural Conditions and Sample-efficiency of Posterior Sampling [28.371541697552928]
We present the first result for non-linear function approximation which holds for general action spaces under a linear embeddability condition.
We show worst case sample complexity guarantees that scale with a rank parameter of the RL problem.
arXiv Detail & Related papers (2022-03-15T20:50:26Z) - Sample Efficient Reinforcement Learning In Continuous State Spaces: A
Perspective Beyond Linearity [50.38337893712897]
We introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes no linearity assumptions.
We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition.
We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently.
arXiv Detail & Related papers (2021-06-15T00:06:59Z) - An Exponential Lower Bound for Linearly-Realizable MDPs with Constant
Suboptimality Gap [66.75488143823337]
We show that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed.
Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting.
arXiv Detail & Related papers (2021-03-23T17:05:54Z) - Bilinear Classes: A Structural Framework for Provable Generalization in
RL [119.42509700822484]
Bilinear Classes is a new structural framework which permits generalization in reinforcement learning.
The framework incorporates nearly all existing models in which a sample complexity is achievable.
Our main result provides an RL algorithm which has sample complexity for Bilinear Classes.
arXiv Detail & Related papers (2021-03-19T16:34:20Z) - Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve
Optimism, Embrace Virtual Curvature [61.22680308681648]
We show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward.
For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL)
arXiv Detail & Related papers (2021-02-08T12:41:56Z) - Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement
Learning? [30.065091907118827]
This work considers sample complexity of finding an $epsilon$-optimal policy in a Markov decision process (MDP)
We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver.
We show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-10-12T13:13:01Z) - Learning the Linear Quadratic Regulator from Nonlinear Observations [135.66883119468707]
We introduce a new problem setting for continuous control called the LQR with Rich Observations, or RichLQR.
In our setting, the environment is summarized by a low-dimensional continuous latent state with linear dynamics and quadratic costs.
Our results constitute the first provable sample complexity guarantee for continuous control with an unknown nonlinearity in the system model and general function approximation.
arXiv Detail & Related papers (2020-10-08T07:02:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.