Bilinear Exponential Family of MDPs: Frequentist Regret Bound with
Tractable Exploration and Planning
- URL: http://arxiv.org/abs/2210.02087v1
- Date: Wed, 5 Oct 2022 08:26:49 GMT
- Title: Bilinear Exponential Family of MDPs: Frequentist Regret Bound with
Tractable Exploration and Planning
- Authors: Reda Ouhamma (CRIStAL), Debabrota Basu (CRIStAL), Odalric-Ambrym
Maillard (CRIStAL)
- Abstract summary: We study the problem of episodic reinforcement learning in continuous state-action spaces with unknown rewards and transitions.
We propose an algorithm, BEF-RLSVI, that uses penalized maximum likelihood estimators to learn the unknown parameters.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of episodic reinforcement learning in continuous
state-action spaces with unknown rewards and transitions. Specifically, we
consider the setting where the rewards and transitions are modeled using
parametric bilinear exponential families. We propose an algorithm, BEF-RLSVI,
that a) uses penalized maximum likelihood estimators to learn the unknown
parameters, b) injects a calibrated Gaussian noise in the parameter of rewards
to ensure exploration, and c) leverages linearity of the exponential family
with respect to an underlying RKHS to perform tractable planning. We further
provide a frequentist regret analysis of BEF-RLSVI that yields an upper bound
of $\tilde{\mathcal{O}}(\sqrt{d^3H^3K})$, where $d$ is the dimension of the
parameters, $H$ is the episode length, and $K$ is the number of episodes. Our
analysis improves the existing bounds for the bilinear exponential family of
MDPs by $\sqrt{H}$ and removes the handcrafted clipping deployed in existing
\RLSVI-type algorithms. Our regret bound is order-optimal with respect to $H$
and $K$.
Related papers
- Settling Constant Regrets in Linear Markov Decision Processes [57.34287648914407]
We study the constant regret guarantees in reinforcement learning (RL)
We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs)
For an MDP characterized by a minimal suboptimality gap $Delta$, Cert-LSVI-UCB has a cumulative regret of $tildemathcalO(d3H5/Delta)$ with high probability, provided that the misspecification level $zeta$ is below $tildemathcalO(Delta / (sqrtd
arXiv Detail & Related papers (2024-04-16T17:23:19Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs)
We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past.
We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Frequentist Regret Bounds for Randomized Least-Squares Value Iteration [94.47472987987805]
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL)
In this paper, we introduce an optimistically-perturbed variant of the popular randomized least-squares value (RLSVI)
Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $widetilde O(d2 H2 sqrtT)$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of
arXiv Detail & Related papers (2019-11-01T19:48:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.