Fast Rates for the Regret of Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2102.00479v2
- Date: Wed, 12 Jul 2023 09:33:34 GMT
- Title: Fast Rates for the Regret of Offline Reinforcement Learning
- Authors: Yichun Hu, Nathan Kallus, Masatoshi Uehara
- Abstract summary: We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
- Score: 69.23654172273085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the regret of reinforcement learning from offline data generated by
a fixed behavior policy in an infinite-horizon discounted Markov decision
process (MDP). While existing analyses of common approaches, such as fitted
$Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret,
empirical behavior exhibits \emph{much} faster convergence. In this paper, we
present a finer regret analysis that exactly characterizes this phenomenon by
providing fast rates for the regret convergence. First, we show that given any
estimate for the optimal quality function $Q^*$, the regret of the policy it
defines converges at a rate given by the exponentiation of the $Q^*$-estimate's
pointwise convergence rate, thus speeding it up. The level of exponentiation
depends on the level of noise in the \emph{decision-making} problem, rather
than the estimation problem. We establish such noise levels for linear and
tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman
residual minimization to establish the correct pointwise convergence
guarantees. As specific cases, our results imply $O(1/n)$ regret rates in
linear cases and $\exp(-\Omega(n))$ regret rates in tabular cases. We extend
our findings to general function approximation by extending our results to
regret guarantees based on $L_p$-convergence rates for estimating $Q^*$ rather
than pointwise rates, where $L_2$ guarantees for nonparametric $Q^*$-estimation
can be ensured under mild conditions.
Related papers
- Tangential Randomization in Linear Bandits (TRAiL): Guaranteed Inference and Regret Bounds [1.03590082373586]
We propose and analyze TRAiL, a regret-optimal forced exploration algorithm for linear bandits.
TraiL ensures a $Omega(sqrtT)$ growth in the inference quality, measured via the minimum eigenvalue of the design (regressor) matrix.
We characterize an $Omega(sqrtT)$ minimax lower bound for any algorithm on the expected regret.
arXiv Detail & Related papers (2024-11-19T01:08:13Z) - Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$.
We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z) - On the Linear Convergence of Policy Gradient under Hadamard
Parameterization [4.182089296199263]
We study the convergence of deterministic policy gradient under the Hadamard parameterization.
We show that the error decreases at an $O(frac1k)$ rate for all the iterations.
arXiv Detail & Related papers (2023-05-31T05:51:15Z) - On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function
Estimation in Off-policy Evaluation [1.575865518040625]
We study the off-policy evaluation problem in an infinite-horizon Markov decision process with continuous states and actions.
We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem.
arXiv Detail & Related papers (2022-01-17T01:09:38Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Almost sure convergence rates for Stochastic Gradient Descent and
Stochastic Heavy Ball [17.33867778750777]
We study gradient descent (SGD) and the heavy ball method (SHB) for the general approximation problem.
For SGD, in the convex and smooth setting, we provide the first emphalmost sure convergence emphrates for a weighted average of the iterates.
arXiv Detail & Related papers (2020-06-14T11:12:05Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.