Value Function Approximations via Kernel Embeddings for No-Regret
Reinforcement Learning
- URL: http://arxiv.org/abs/2011.07881v3
- Date: Tue, 28 Jun 2022 16:01:01 GMT
- Title: Value Function Approximations via Kernel Embeddings for No-Regret
Reinforcement Learning
- Authors: Sayak Ray Chowdhury, Rafael Oliveira
- Abstract summary: We propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a kernel Hilbert space.
We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $tildeObig(Hgamma_NsqrtNbig)$footnote $tildeO(cdot)$ hides only absolute constant and poly-logarithmic factors.
- Score: 10.828727066443909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the regret minimization problem in reinforcement learning (RL) in
the episodic setting. In many real-world RL environments, the state and action
spaces are continuous or very large. Existing approaches establish regret
guarantees by either a low-dimensional representation of the stochastic
transition model or an approximation of the $Q$-functions. However, the
understanding of function approximation schemes for state-value functions
largely remains missing. In this paper, we propose an online model-based RL
algorithm, namely the CME-RL, that learns representations of transition
distributions as embeddings in a reproducing kernel Hilbert space while
carefully balancing the exploitation-exploration tradeoff. We demonstrate the
efficiency of our algorithm by proving a frequentist (worst-case) regret bound
that is of order $\tilde{O}\big(H\gamma_N\sqrt{N}\big)$\footnote{
$\tilde{O}(\cdot)$ hides only absolute constant and poly-logarithmic factors.},
where $H$ is the episode length, $N$ is the total number of time steps and
$\gamma_N$ is an information theoretic quantity relating the effective
dimension of the state-action feature space. Our method bypasses the need for
estimating transition probabilities and applies to any domain on which kernels
can be defined. It also brings new insights into the general theory of kernel
methods for approximate inference and RL regret minimization.
Related papers
- Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback [30.23951525723659]
We study the low-rank MDPs with adversarially changed losses in the full-information feedback setting.
We propose a policy optimization-based algorithm POLO, and we prove that it attains the $widetildeO(Kfrac56Afrac12dln (1+M)/ (1-gamma)2)$ regret guarantee.
arXiv Detail & Related papers (2023-11-14T03:12:43Z) - Kernelized Reinforcement Learning with Order Optimal Regret Bounds [11.024396385514864]
$pi$KRVI is an optimistic modification of least trivial Hilbert-squares value.
We prove the first order-optimal regret guarantees under a general setting.
We show a sublinear regret bound that is order optimal in the case of Mat'ern kernels.
arXiv Detail & Related papers (2023-06-13T13:01:42Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Perturbational Complexity by Distribution Mismatch: A Systematic
Analysis of Reinforcement Learning in Reproducing Kernel Hilbert Space [0.76146285961466]
We analyze reinforcement learning in a general reproducing kernel Hilbert space (RKHS)
We consider a family of Markov decision processes $mathcalM$ of which the reward functions lie in the unit ball of an RKHS.
We show that when the reward functions lie in a high dimensional RKHS, even if the transition probability is known and the action space is finite, it is still possible for RL problems to suffer from the curse of dimensionality.
arXiv Detail & Related papers (2021-11-05T12:46:04Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z) - Reinforcement Learning with General Value Function Approximation:
Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation.
We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure.
Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z) - Kernel-Based Reinforcement Learning: A Finite-Time Analysis [53.47210316424326]
We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards.
We empirically validate our approach in continuous MDPs with sparse rewards.
arXiv Detail & Related papers (2020-04-12T12:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.