Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs
with Short Burn-In Time
- URL: http://arxiv.org/abs/2305.15546v2
- Date: Tue, 12 Dec 2023 05:40:30 GMT
- Title: Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs
with Short Burn-In Time
- Authors: Xiang Ji, Gen Li
- Abstract summary: We introduce a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner.
This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
- Score: 13.545356254920584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A crucial problem in reinforcement learning is learning the optimal policy.
We study this in tabular infinite-horizon discounted Markov decision processes
under the online setting. The existing algorithms either fail to achieve regret
optimality or have to incur a high memory and computational cost. In addition,
existing optimal algorithms all require a long burn-in time in order to achieve
optimal sample efficiency, i.e., their optimality is not guaranteed unless
sample size surpasses a high threshold. We address both open problems by
introducing a model-free algorithm that employs variance reduction and a novel
technique that switches the execution policy in a slow-yet-adaptive manner.
This is the first regret-optimal model-free algorithm in the discounted
setting, with the additional benefit of a low burn-in time.
Related papers
- Near-Optimal Algorithm for Non-Stationary Kernelized Bandits [6.379833644595456]
We study a non-stationary kernelized bandit (KB) problem, also called time-varying Bayesian optimization.
We show the first algorithm-independent regret lower bound for non-stationary KB with squared exponential and Mat'ern kernels.
We propose a novel near-optimal algorithm called restarting phased elimination with random permutation.
arXiv Detail & Related papers (2024-10-21T14:28:26Z) - Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes [12.76843681997386]
Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice.
This paper proposes a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model.
Our algorithm achieves regret with improved dependence on the other parameters of the problem.
arXiv Detail & Related papers (2024-07-03T12:36:24Z) - Accelerating Cutting-Plane Algorithms via Reinforcement Learning
Surrogates [49.84541884653309]
A current standard approach to solving convex discrete optimization problems is the use of cutting-plane algorithms.
Despite the existence of a number of general-purpose cut-generating algorithms, large-scale discrete optimization problems continue to suffer from intractability.
We propose a method for accelerating cutting-plane algorithms via reinforcement learning.
arXiv Detail & Related papers (2023-07-17T20:11:56Z) - Accelerated First-Order Optimization under Nonlinear Constraints [73.2273449996098]
We exploit between first-order algorithms for constrained optimization and non-smooth systems to design a new class of accelerated first-order algorithms.
An important property of these algorithms is that constraints are expressed in terms of velocities instead of sparse variables.
arXiv Detail & Related papers (2023-02-01T08:50:48Z) - Optimal Parameter-free Online Learning with Switching Cost [47.415099037249085]
-freeness in online learning refers to the adaptivity of an algorithm with respect to the optimal decision in hindsight.
In this paper, we design such algorithms in the presence of switching cost - the latter penalizes the optimistic updates required by parameter-freeness.
We propose a simple yet powerful algorithm for Online Linear Optimization (OLO) with switching cost, which improves the existing suboptimal regret bound [ZCP22a] to the optimal rate.
arXiv Detail & Related papers (2022-05-13T18:44:27Z) - Implicit Parameter-free Online Learning with Truncated Linear Models [51.71216912089413]
parameter-free algorithms are online learning algorithms that do not require setting learning rates.
We propose new parameter-free algorithms that can take advantage of truncated linear models through a new update that has an "implicit" flavor.
Based on a novel decomposition of the regret, the new update is efficient, requires only one gradient at each step, never overshoots the minimum of the truncated model, and retains the favorable parameter-free properties.
arXiv Detail & Related papers (2022-03-19T13:39:49Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Adaptive Importance Sampling for Finite-Sum Optimization and Sampling
with Decreasing Step-Sizes [4.355567556995855]
We propose Avare, a simple and efficient algorithm for adaptive importance sampling for finite-sum optimization and sampling with decreasing step-sizes.
Under standard technical conditions, we show that Avare achieves $mathcalO(T2/3)$ and $mathcalO(T5/6)$ dynamic regret for SGD and SGLD respectively when run with $mathcalO(T5/6)$ step sizes.
arXiv Detail & Related papers (2021-03-23T00:28:15Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Better Parameter-free Stochastic Optimization with ODE Updates for
Coin-Betting [31.60239268539764]
PFSGD algorithms do not require setting learning rates while achieving optimal theoretical performance.
In this paper, we close the empirical gap with a new parameter-free algorithm based on continuous-time Coin-Betting on truncated models.
We show empirically that this new parameter-free algorithm outperforms algorithms with the "best default" learning rates and almost matches the performance of finely tuned baselines without anything to tune.
arXiv Detail & Related papers (2020-06-12T23:10:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.