Related papers: Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards

Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards

URL: http://arxiv.org/abs/2306.11455v1
Date: Tue, 20 Jun 2023 11:12:21 GMT
Title: Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards
Authors: Semih Cayci and Atilla Eryilmaz
Abstract summary: We establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism can be provably robustified against heavy-tailed reward distributions. We show that a robust variant of NAC based on TD learning achieves $tildemathcalO(varepsilon-frac1p)$ sample complexity.
Score: 27.209606183563853
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In a broad class of reinforcement learning applications, stochastic rewards have heavy-tailed distributions, which lead to infinite second-order moments for stochastic (semi)gradients in policy evaluation and direct policy optimization. In such instances, the existing RL methods may fail miserably due to frequent statistical outliers. In this work, we establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism, and correspondingly operated natural actor-critic (NAC), can be provably robustified against heavy-tailed reward distributions. It is shown in the framework of linear function approximation that a favorable tradeoff between bias and variability of the stochastic gradients can be achieved with this dynamic gradient clipping mechanism. In particular, we prove that robust versions of TD learning achieve sample complexities of order $\mathcal{O}(\varepsilon^{-\frac{1}{p}})$ and $\mathcal{O}(\varepsilon^{-1-\frac{1}{p}})$ with and without the full-rank assumption on the feature matrix, respectively, under heavy-tailed rewards with finite moments of order $(1+p)$ for some $p\in(0,1]$, both in expectation and with high probability. We show that a robust variant of NAC based on Robust TD learning achieves $\tilde{\mathcal{O}}(\varepsilon^{-4-\frac{2}{p}})$ sample complexity. We corroborate our theoretical results with numerical experiments.

Related papers

A stochastic first-order method with multi-extrapolated momentum for highly smooth unconstrained optimization [3.8919212824749296]
We show that the proposed SFOM can accelerate optimization by exploiting the high-order smoothness of the objective function $f$. To the best of our knowledge, this is the first SFOM to leverage arbitrary-order smoothness of the objective function for acceleration.
arXiv Detail & Related papers (2024-12-19T03:22:47Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm [6.481009996429766]
inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal. This work proposes a model-free algorithm to solve the entropy-regularized IRL problem.
arXiv Detail & Related papers (2024-03-25T14:54:42Z)
Statistical Efficiency of Distributional Temporal Difference Learning [24.03281329962804]
We analyze the finite-sample performance of distributional temporal difference learning (CTD) and quantile temporal difference learning (QTD) For a $gamma$-discounted infinite-horizon decision process, we show that for NTD we need $tildeOleft(frac1varepsilon2p (1-gamma)2pright)$ to achieve an $varepsilon$-optimal estimator with high probability. We establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.
arXiv Detail & Related papers (2024-03-09T06:19:53Z)
Demonstration-Regularized RL [39.96273388393764]
Using expert demonstrations, we identify an optimal policy at a sample complexity of order $widetildeO(mathrmPoly(S,A,H)/(varepsilon2 NmathrmE)$ in finite and $widetildeO(mathrmPoly(d,H)/(varepsilon2 NmathrmE)$ in linear Markov decision processes. We establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-10-26T10:54:47Z)
A Finite Sample Complexity Bound for Distributionally Robust Q-learning [17.96094201655567]
We consider a reinforcement learning setting in which the deployment environment is different from the training environment. Applying a robust Markov decision processes formulation, we extend the distributionally robust $Q$-learning framework studied in Liu et al. This is the first sample complexity result for the model-free robust RL problem.
arXiv Detail & Related papers (2023-02-26T01:15:32Z)
Stochastic Second-Order Methods Provably Beat SGD For Gradient-Dominated Functions [42.57892322514602]
We show that SCRN improves the best-known sample complexity of gradient descent by a factor of $mathcalO(epsilon-1/2)$. We also show that the sample complexity of SCRN can be improved by a factor of $mathcalO(epsilon-1/2)$ using a variance reduction method with time-varying batch sizes.
arXiv Detail & Related papers (2022-05-25T15:33:00Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems. In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z)
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails. We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$. Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions. In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.