Related papers: Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation

Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation

URL: http://arxiv.org/abs/2106.11612v1
Date: Tue, 22 Jun 2021 08:48:56 GMT
Title: Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation
Authors: Jiafan He and Dongruo Zhou and Quanquan Gu
Abstract summary: We study reinforcement learning with linear function approximation. Existing algorithms only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees. We propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability.
Score: 92.3161051419884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.

Related papers

Adaptive Resolving Methods for Reinforcement Learning with Function Approximations [4.168629519090361]
We develop a new algorithm to solve theReinforcement learning problems with function approximation.<n>Our algorithm is based on the linear programming (LP) reformulation and it resolves the LP at each improved with new data arrival.<n>In comparison to the $O(1/sqrtN)$ worst-case guarantee established in the previous literature, our instance-dependent guarantee is tighter when the underlying instance is favorable.
arXiv Detail & Related papers (2025-05-17T14:59:15Z)
A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees [28.974797385513263]
We study a primal-dual (PD) reinforcement learning (RL) algorithm for online Markov constrained decision processes (CMDPs) In this paper, we introduce a novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem.
arXiv Detail & Related papers (2024-01-31T12:23:24Z)
Improved High-Probability Bounds for the Temporal Difference Learning Algorithm via Exponential Stability [17.771354881467435]
We show that a simple algorithm with a universal and instance-independent step size is sufficient to obtain near-optimal variance and bias terms. Our proof technique is based on refined error bounds for linear approximation together with the novel stability result for the product of random matrices.
arXiv Detail & Related papers (2023-10-22T12:37:25Z)
Uniform-PAC Guarantees for Model-Based RL with Bounded Eluder Dimension [86.3584476711976]
We propose algorithms for both nonlinear bandits and model-based episodic RL using the general function class with a bounded eluder. The achieved uniform-PAC sample complexity is tight in the sense that it matches the state-of-the-art regret bounds or sample complexity guarantees when reduced to the linear case.
arXiv Detail & Related papers (2023-05-15T05:07:45Z)
A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser [6.2484576862659065]
This paper presents a new convergent Plug-and-fidelity Descent (Play) algorithm. The algorithm converges for a wider range of regular convexization parameters, thus allowing more accurate restoration of an image.
arXiv Detail & Related papers (2023-01-31T16:11:47Z)
PAC-Bayesian Learning of Optimization Algorithms [6.624726878647541]
We apply the PAC-Bayes theory to the setting of learning-to-optimize. We learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families.
arXiv Detail & Related papers (2022-10-20T09:16:36Z)
A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states. We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions. We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.