Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function
Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
- URL: http://arxiv.org/abs/2306.06836v3
- Date: Thu, 7 Mar 2024 15:29:58 GMT
- Title: Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function
Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
- Authors: Jiayi Huang, Han Zhong, Liwei Wang, Lin F. Yang
- Abstract summary: In this work, we address the challenge of such rewards in reinforcement learning with linear function approximation.
We first design an algorithm, textscHeavy-OFUL, for heavy-tailed linear bandits, achieving an emphinstance-dependent $T$-round regret of $tildeObig.
Our result is achieved via a novel self-normalized concentration inequality that may be of independent interest in handling heavytailed noise in general online regression problems.
- Score: 26.277745106128197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While numerous works have focused on devising efficient algorithms for
reinforcement learning (RL) with uniformly bounded rewards, it remains an open
question whether sample or time-efficient algorithms for RL with large
state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with
only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this
work, we address the challenge of such rewards in RL with linear function
approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for
heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round
regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}}
\sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the
\emph{first} of this kind. Here, $d$ is the feature dimension, and
$\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at
the $t$-th round. We further show the above bound is minimax optimal when
applied to the worst-case instances in stochastic and deterministic linear
bandits. We then extend this algorithm to the RL settings with linear function
approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the
\emph{first} computationally efficient \emph{instance-dependent} $K$-episode
regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d
\sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and
$\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with
the central moment of reward and value functions, respectively. We also provide
a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d
\sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst
case. Our result is achieved via a novel robust self-normalized concentration
inequality that may be of independent interest in handling heavy-tailed noise
in general online regression problems.
Related papers
- Efficient Continual Finite-Sum Minimization [52.5238287567572]
We propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization.
Our approach significantly improves upon the $mathcalO(n/epsilon)$ FOs that $mathrmStochasticGradientDescent$ requires.
We also prove that there is no natural first-order method with $mathcalOleft(n/epsilonalpharight)$ complexity gradient for $alpha 1/4$, establishing that the first-order complexity of our method is nearly tight.
arXiv Detail & Related papers (2024-06-07T08:26:31Z) - Sample-Efficient Linear Regression with Self-Selection Bias [7.605563562103568]
We consider the problem of linear regression with self-selection bias in the unknown-index setting.
We provide a novel and near optimally sample-efficient (in terms of $k$) algorithm to recover $mathbfw_1,ldots,mathbfw_kin.
Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression.
arXiv Detail & Related papers (2024-02-22T02:20:24Z) - Randomized and Deterministic Attention Sparsification Algorithms for
Over-parameterized Feature Dimension [18.57735939471469]
We consider the sparsification of the attention problem.
For any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence.
arXiv Detail & Related papers (2023-04-10T05:52:38Z) - Eluder-based Regret for Stochastic Contextual MDPs [43.19667415823089]
We present the E-UC$3$RL algorithm for regret minimization in Contextual Markov Decision Processes (CMDPs)
Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of $ widetildeO(H3 sqrtT |S| |A|d_mathrmE(mathcalP)$.
arXiv Detail & Related papers (2022-11-27T20:38:47Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
We study the problem of PAC learning halfspaces on $mathbbRd$ with Massart noise under Gaussian marginals.
Our results qualitatively characterize the complexity of learning halfspaces in the Massart model.
arXiv Detail & Related papers (2021-08-19T16:16:48Z) - Faster Rates of Differentially Private Stochastic Convex Optimization [7.93728520583825]
We study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $theta>1$.
In the second part, we focus on a special case where the population risk function is strongly convex.
arXiv Detail & Related papers (2021-07-31T22:23:39Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - On the Complexity of Minimizing Convex Finite Sums Without Using the
Indices of the Individual Functions [62.01594253618911]
We exploit the finite noise structure of finite sums to derive a matching $O(n2)$-upper bound under the global oracle model.
Following a similar approach, we propose a novel adaptation of SVRG which is both emphcompatible with oracles, and achieves complexity bounds of $tildeO(n2+nsqrtL/mu)log (1/epsilon)$ and $O(nsqrtL/epsilon)$, for $mu>0$ and $mu=0$
arXiv Detail & Related papers (2020-02-09T03:39:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.