Related papers: Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

URL: http://arxiv.org/abs/2403.05811v2
Date: Thu, 14 Mar 2024 09:24:51 GMT
Title: Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces
Authors: Yang Peng, Liangyu Zhang, Zhihua Zhang,
Abstract summary: We propose a non-parametric distributional TD algorithm (NTD) for a $gamma$-discounted infinite-horizon Markov decision process. We establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest.
Score: 24.03281329962804
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. The distributional temporal difference (TD) algorithm has been accordingly proposed, which is an extension of the temporal difference algorithm in the classic RL literature. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference algorithm (CTD) and quantile temporal difference algorithm (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose a non-parametric distributional TD algorithm (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal (up to logarithmic factors) in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance.

Related papers

A Sharp Convergence Theory for The Probability Flow ODEs of Diffusion Models [45.60426164657739]
We develop non-asymptotic convergence theory for a diffusion-based sampler. We prove that $d/varepsilon$ are sufficient to approximate the target distribution to within $varepsilon$ total-variation distance. Our results also characterize how $ell$ score estimation errors affect the quality of the data generation processes.
arXiv Detail & Related papers (2024-08-05T09:02:24Z)
Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process. For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature. We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Convergence Analysis of Probability Flow ODE for Score-based Generative Models [5.939858158928473]
We study the convergence properties of deterministic samplers based on probability flow ODEs from both theoretical and numerical perspectives. We prove the total variation between the target and the generated data distributions can be bounded above by $mathcalO(d3/4delta1/2)$ in the continuous time level.
arXiv Detail & Related papers (2024-04-15T12:29:28Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost. We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z)
Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards [27.209606183563853]
We establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism can be provably robustified against heavy-tailed reward distributions. We show that a robust variant of NAC based on TD learning achieves $tildemathcalO(varepsilon-frac1p)$ sample complexity.
arXiv Detail & Related papers (2023-06-20T11:12:21Z)
Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative Models [49.81937966106691]
We develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models. In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach.
arXiv Detail & Related papers (2023-06-15T16:30:08Z)
Policy evaluation from a single path: Multi-step methods, mixing and mis-specification [45.88067550131531]
We study non-parametric estimation of the value function of an infinite-horizon $gamma$-discounted Markov reward process. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference estimates.
arXiv Detail & Related papers (2022-11-07T23:15:25Z)
Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions [9.953088581242845]
We provide convergence guarantees with complexity for any data distribution with second-order moment. Our result does not rely on any log-concavity or functional inequality assumption. Our theoretical analysis provides comparison between different discrete approximations and may guide the choice of discretization points in practice.
arXiv Detail & Related papers (2022-11-03T15:51:00Z)
Settling the Sample Complexity of Model-Based Offline Reinforcement Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration. Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality. We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z)
High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize [55.0090961425708]
We propose a new, simplified high probability analysis of AdaGrad for smooth, non- probability problems. We present our analysis in a modular way and obtain a complementary $mathcal O (1 / TT)$ convergence rate in the deterministic setting. To the best of our knowledge, this is the first high probability for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness.
arXiv Detail & Related papers (2022-04-06T13:50:33Z)
Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process. We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator. We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z)
Limit Distribution Theory for the Smooth 1-Wasserstein Distance with Applications [18.618590805279187]
smooth 1-Wasserstein distance (SWD) $W_1sigma$ was recently proposed as a means to mitigate the curse of dimensionality in empirical approximation. This work conducts a thorough statistical study of the SWD, including a high-dimensional limit distribution result.
arXiv Detail & Related papers (2021-07-28T17:02:24Z)
Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations [0.3553493344868413]
We study the Wasserstein distance between the in distribution of an ergodic differential equation and the distribution in the strongly log-concave case. This allows us to study in a unified way a number of different approximations proposed in the literature for the overdamped and underdamped Langevin dynamics.
arXiv Detail & Related papers (2021-04-26T07:50:04Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$. Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z)
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning. Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$. Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.