Kernel-Based Reinforcement Learning: A Finite-Time Analysis
- URL: http://arxiv.org/abs/2004.05599v3
- Date: Wed, 23 Mar 2022 18:36:06 GMT
- Title: Kernel-Based Reinforcement Learning: A Finite-Time Analysis
- Authors: Omar Darwiche Domingues, Pierre M\'enard, Matteo Pirotta, Emilie
Kaufmann, Michal Valko
- Abstract summary: We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards.
We empirically validate our approach in continuous MDPs with sparse rewards.
- Score: 53.47210316424326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the exploration-exploitation dilemma in finite-horizon
reinforcement learning problems whose state-action space is endowed with a
metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that
leverages the smoothness of the MDP and a non-parametric kernel estimator of
the rewards and transitions to efficiently balance exploration and
exploitation. For problems with $K$ episodes and horizon $H$, we provide a
regret bound of $\widetilde{O}\left( H^3 K^{\frac{2d}{2d+1}}\right)$, where $d$
is the covering dimension of the joint state-action space. This is the first
regret bound for kernel-based RL using smoothing kernels, which requires very
weak assumptions on the MDP and has been previously applied to a wide range of
tasks. We empirically validate our approach in continuous MDPs with sparse
rewards.
Related papers
- Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback [30.23951525723659]
We study the low-rank MDPs with adversarially changed losses in the full-information feedback setting.
We propose a policy optimization-based algorithm POLO, and we prove that it attains the $widetildeO(Kfrac56Afrac12dln (1+M)/ (1-gamma)2)$ regret guarantee.
arXiv Detail & Related papers (2023-11-14T03:12:43Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Restricted Strong Convexity of Deep Learning Models with Smooth
Activations [31.003601717265006]
We study the problem of optimization of deep learning models with smooth activation functions.
We introduce a new analysis of optimization based on Restricted Strong Convexity (RSC)
Ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models.
arXiv Detail & Related papers (2022-09-29T21:24:26Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Value Function Approximations via Kernel Embeddings for No-Regret
Reinforcement Learning [10.828727066443909]
We propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a kernel Hilbert space.
We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $tildeObig(Hgamma_NsqrtNbig)$footnote $tildeO(cdot)$ hides only absolute constant and poly-logarithmic factors.
arXiv Detail & Related papers (2020-11-16T11:40:55Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.