Uniform-PAC Guarantees for Model-Based RL with Bounded Eluder Dimension
- URL: http://arxiv.org/abs/2305.08350v1
- Date: Mon, 15 May 2023 05:07:45 GMT
- Title: Uniform-PAC Guarantees for Model-Based RL with Bounded Eluder Dimension
- Authors: Yue Wu and Jiafan He and Quanquan Gu
- Abstract summary: We propose algorithms for both nonlinear bandits and model-based episodic RL using the general function class with a bounded eluder.
The achieved uniform-PAC sample complexity is tight in the sense that it matches the state-of-the-art regret bounds or sample complexity guarantees when reduced to the linear case.
- Score: 86.3584476711976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been remarkable progress in reinforcement learning (RL)
with general function approximation. However, all these works only provide
regret or sample complexity guarantees. It is still an open question if one can
achieve stronger performance guarantees, i.e., the uniform probably approximate
correctness (Uniform-PAC) guarantee that can imply both a sub-linear regret
bound and a polynomial sample complexity for any target learning accuracy. We
study this problem by proposing algorithms for both nonlinear bandits and
model-based episodic RL using the general function class with a bounded eluder
dimension. The key idea of the proposed algorithms is to assign each action to
different levels according to its width with respect to the confidence set. The
achieved uniform-PAC sample complexity is tight in the sense that it matches
the state-of-the-art regret bounds or sample complexity guarantees when reduced
to the linear case. To the best of our knowledge, this is the first work for
uniform-PAC guarantees on bandit and RL that goes beyond linear cases.
Related papers
- Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback [38.61232011566285]
We study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually.
In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees.
arXiv Detail & Related papers (2024-05-13T10:51:01Z) - Optimal PAC Bounds Without Uniform Convergence [11.125968799758436]
We provide optimal high probability risk bounds through a framework that surpasses the limitations of uniform convergence arguments.
Our framework converts the leave-one-out error of permutation invariant predictors into high probability risk bounds.
Specifically, our result shows that certain aggregations of one-inclusion graph algorithms are optimal.
arXiv Detail & Related papers (2023-04-18T17:57:31Z) - Single-Trajectory Distributionally Robust Reinforcement Learning [21.955807398493334]
We propose Distributionally Robust RL (DRRL) to enhance performance across a range of environments.
Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory.
We design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ)
arXiv Detail & Related papers (2023-01-27T14:08:09Z) - Uniform-PAC Bounds for Reinforcement Learning with Linear Function
Approximation [92.3161051419884]
We study reinforcement learning with linear function approximation.
Existing algorithms only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees.
We propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability.
arXiv Detail & Related papers (2021-06-22T08:48:56Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve
Optimism, Embrace Virtual Curvature [61.22680308681648]
We show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward.
For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL)
arXiv Detail & Related papers (2021-02-08T12:41:56Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.