Related papers: Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

URL: http://arxiv.org/abs/2407.21260v1
Date: Wed, 31 Jul 2024 00:43:51 GMT
Title: Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation
Authors: Taehyun Cho, Seungyub Han, Kyungjae Lee, Seokhun Ju, Dohyeong Kim, Jungwoo Lee,
Abstract summary: We present a regret analysis for distributional reinforcement learning with general value function approximation. Our theoretical results show that approximating the infinite-dimensional return distribution with a finite number of moment functionals is the only method to learn the statistical information unbiasedly.
Score: 8.378137704007038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Distributional reinforcement learning improves performance by effectively capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In this paper, we present a regret analysis for distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of Bellman unbiasedness for a tractable and exactly learnable update via statistical functional dynamic programming. Our theoretical results show that approximating the infinite-dimensional return distribution with a finite number of moment functionals is the only method to learn the statistical information unbiasedly, including nonlinear statistical functionals. Second, we propose a provably efficient algorithm, $\texttt{SF-LSVI}$, achieving a regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.

Related papers

Statistical Inference and Learning for Shapley Additive Explanations (SHAP) [20.663970002208846]
The SHAP (short for Shapley additive explanation) framework has become an essential tool for attributing importance to variables in predictive tasks.<n>Despite their ubiquity, there do not exist approaches for performing statistical inference on these quantities.<n>We show that, by treating the SHAP curve as a nuisance function that must be estimated from data, one can reliably constructally normal estimates of the $p$th powers of SHAP.
arXiv Detail & Related papers (2026-02-11T05:01:47Z)
A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning [48.70183357021465]
Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models.<n>We propose a emphrelative-budget theory explaining this variation through a single quantity called relative budget $:= H/mathbbE[T]$.<n>We show that $$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories.
arXiv Detail & Related papers (2026-02-02T01:31:52Z)
Statistical Inference for Explainable Boosting Machines [5.01181440341076]
Explainable boosting machines (EBMs) are popular "glass-box" models with visualizations of each feature's effect.<n>We provide an alternative using recent advances in statistical inference for gradient boosting, deriving methods for statistical inference as well as end-to-end theoretical guarantees.
arXiv Detail & Related papers (2026-01-26T17:51:09Z)
Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z)
Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z)
Online Statistical Inference for Nonlinear Stochastic Approximation with Markovian Data [22.59079286063505]
We study the statistical inference of nonlinear approximation algorithms utilizing a single trajectory of Markovian data. Our methodology has practical applications in various scenarios, such as Gradient Descent (SGD) on autoregressive data and asynchronous Q-Learning.
arXiv Detail & Related papers (2023-02-15T14:31:11Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z)
Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient [65.08966446962845]
offline reinforcement learning, which aims at optimizing decision-making strategies with historical data, has been extensively applied in real-life applications. We take a step by considering offline reinforcement learning with differentiable function class approximation (DFA) Most importantly, we show offline differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning algorithm.
arXiv Detail & Related papers (2022-10-03T07:59:42Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Provable General Function Class Representation Learning in Multitask Bandits and MDPs [58.624124220900306]
multitask representation learning is a popular approach in reinforcement learning to boost the sample efficiency. In this work, we extend the analysis to general function class representations. We theoretically validate the benefit of multitask representation learning within general function class for bandits and linear MDP.
arXiv Detail & Related papers (2022-05-31T11:36:42Z)
A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions [39.17511693008055]
Estimating value functions is a core component of reinforcement learning algorithms. We focus on bootstrapping targets used when estimating value functions. We propose a new backup target, the $eta$-return mixture.
arXiv Detail & Related papers (2022-01-05T21:54:55Z)
Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback. We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode. We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z)
Value Function Approximations via Kernel Embeddings for No-Regret Reinforcement Learning [10.828727066443909]
We propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a kernel Hilbert space. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $tildeObig(Hgamma_NsqrtNbig)$footnote $tildeO(cdot)$ hides only absolute constant and poly-logarithmic factors.
arXiv Detail & Related papers (2020-11-16T11:40:55Z)
Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation. We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure. Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z)
A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms [67.67377846416106]
We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes. We show that value-based methods such as TD($lambda$) and $Q$-Learning have update rules which are contractive in the space of distributions of functions.
arXiv Detail & Related papers (2020-03-27T05:13:29Z)
Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum-$\ell_1$-Norm Interpolated Classifiers [3.167685495996986]
This paper establishes a precise high-dimensional theory for boosting on separable data. Under a class of statistical models, we provide an exact analysis of the universality error of boosting. We also explicitly pin down the relation between the boosting test error and the optimal Bayes error.
arXiv Detail & Related papers (2020-02-05T00:24:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.