Restarted Bayesian Online Change-point Detection for Non-Stationary
Markov Decision Processes
- URL: http://arxiv.org/abs/2304.00232v1
- Date: Sat, 1 Apr 2023 05:26:41 GMT
- Title: Restarted Bayesian Online Change-point Detection for Non-Stationary
Markov Decision Processes
- Authors: Reda Alami, Mohammed Mahfoud, Eric Moulines
- Abstract summary: We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD)
We propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution.
We show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $Oleft(D O sqrtA T K_T logleft (fracTdelta right) + fracK_Tdeltaminlimits_ell.
- Score: 12.229154524476405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of learning in a non-stationary reinforcement
learning (RL) environment, where the setting can be fully described by a
piecewise stationary discrete-time Markov decision process (MDP). We introduce
a variant of the Restarted Bayesian Online Change-Point Detection algorithm
(R-BOCPD) that operates on input streams originating from the more general
multinomial distribution and provides near-optimal theoretical guarantees in
terms of false-alarm rate and detection delay. Based on this, we propose an
improved version of the UCRL2 algorithm for MDPs with state transition kernel
sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We
perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a
favorable regret bound of $O\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta}
\right) + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \:
\mathbf{KL}\left(
{\mathbf{\theta}^{(\ell+1)}}\mid\mid{\mathbf{\theta}^{(\ell)}}\right)}}\right)$,
where $D$ is the largest MDP diameter from the set of MDPs defining the
piecewise stationary MDP setting, $O$ is the finite number of states (constant
over all changes), $A$ is the finite number of actions (constant over all
changes), $K_T$ is the number of change points up to horizon $T$, and
$\mathbf{\theta}^{(\ell)}$ is the transition kernel during the interval
$[c_\ell, c_{\ell+1})$, which we assume to be multinomially distributed over
the set of states $\mathbb{O}$. Interestingly, the performance bound does not
directly scale with the variation in MDP state transition distributions and
rewards, ie. can also model abrupt changes. In practice, R-BOCPD-UCRL2
outperforms the state-of-the-art in a variety of scenarios in synthetic
environments. We provide a detailed experimental setup along with a code
repository (upon publication) that can be used to easily reproduce our
experiments.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
We establish the complexity bound $widetildeOleft(SAfracH (1-gamma)2varepsilon2 right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2023-11-22T15:34:44Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.