Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards
- URL: http://arxiv.org/abs/2306.11455v1
- Date: Tue, 20 Jun 2023 11:12:21 GMT
- Title: Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards
- Authors: Semih Cayci and Atilla Eryilmaz
- Abstract summary: We establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism can be provably robustified against heavy-tailed reward distributions.
We show that a robust variant of NAC based on TD learning achieves $tildemathcalO(varepsilon-frac1p)$ sample complexity.
- Score: 27.209606183563853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a broad class of reinforcement learning applications, stochastic rewards
have heavy-tailed distributions, which lead to infinite second-order moments
for stochastic (semi)gradients in policy evaluation and direct policy
optimization. In such instances, the existing RL methods may fail miserably due
to frequent statistical outliers. In this work, we establish that temporal
difference (TD) learning with a dynamic gradient clipping mechanism, and
correspondingly operated natural actor-critic (NAC), can be provably
robustified against heavy-tailed reward distributions. It is shown in the
framework of linear function approximation that a favorable tradeoff between
bias and variability of the stochastic gradients can be achieved with this
dynamic gradient clipping mechanism. In particular, we prove that robust
versions of TD learning achieve sample complexities of order
$\mathcal{O}(\varepsilon^{-\frac{1}{p}})$ and
$\mathcal{O}(\varepsilon^{-1-\frac{1}{p}})$ with and without the full-rank
assumption on the feature matrix, respectively, under heavy-tailed rewards with
finite moments of order $(1+p)$ for some $p\in(0,1]$, both in expectation and
with high probability. We show that a robust variant of NAC based on Robust TD
learning achieves $\tilde{\mathcal{O}}(\varepsilon^{-4-\frac{2}{p}})$ sample
complexity. We corroborate our theoretical results with numerical experiments.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm [6.481009996429766]
inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal.
This work proposes a model-free algorithm to solve the entropy-regularized IRL problem.
arXiv Detail & Related papers (2024-03-25T14:54:42Z) - Statistical Efficiency of Distributional Temporal Difference Learning [24.03281329962804]
We analyze the finite-sample performance of distributional temporal difference learning (CTD) and quantile temporal difference learning (QTD)
For a $gamma$-discounted infinite-horizon decision process, we show that for NTD we need $tildeOleft(frac1varepsilon2p (1-gamma)2pright)$ to achieve an $varepsilon$-optimal estimator with high probability.
We establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.
arXiv Detail & Related papers (2024-03-09T06:19:53Z) - Demonstration-Regularized RL [39.96273388393764]
Using expert demonstrations, we identify an optimal policy at a sample complexity of order $widetildeO(mathrmPoly(S,A,H)/(varepsilon2 NmathrmE)$ in finite and $widetildeO(mathrmPoly(d,H)/(varepsilon2 NmathrmE)$ in linear Markov decision processes.
We establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-10-26T10:54:47Z) - A Finite Sample Complexity Bound for Distributionally Robust Q-learning [17.96094201655567]
We consider a reinforcement learning setting in which the deployment environment is different from the training environment.
Applying a robust Markov decision processes formulation, we extend the distributionally robust $Q$-learning framework studied in Liu et al.
This is the first sample complexity result for the model-free robust RL problem.
arXiv Detail & Related papers (2023-02-26T01:15:32Z) - Stochastic Second-Order Methods Provably Beat SGD For Gradient-Dominated
Functions [42.57892322514602]
We show that SCRN improves the best-known sample complexity of gradient descent by a factor of $mathcalO(epsilon-1/2)$.
We also show that the sample complexity of SCRN can be improved by a factor of $mathcalO(epsilon-1/2)$ using a variance reduction method with time-varying batch sizes.
arXiv Detail & Related papers (2022-05-25T15:33:00Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Tightening the Dependence on Horizon in the Sample Complexity of
Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$.
Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions.
In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.