Fugu-MT 論文翻訳(概要): Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes

論文の概要: Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes

arxiv url: http://arxiv.org/abs/2304.00232v1
Date: Sat, 1 Apr 2023 05:26:41 GMT
ステータス: 翻訳完了
システム内更新日: 2023-04-04 19:02:08.377984
Title: Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes
Title（参考訳）: 非定常マルコフ決定過程に対するベイズオンライン変更点検出の再開
Authors: Reda Alami, Mohammed Mahfoud, Eric Moulines
Abstract要約: 我々は、Restarted Bayesian Online Change-Point Detectionアルゴリズム(R-BOCPD)の変種を導入する。多項分布から標本化された状態遷移カーネルを用いたMPP用UCRL2アルゴリズムの改良版を提案する。我々は,R-BOCPD-UCRL2が$Oleft(D O sqrtA T K_T logleft (fracTdelta right) + fracK_Tdeltaminlimits_ell の好意的な後悔境界を享受していることを示す。
参考スコア（独自算出の注目度）: 12.229154524476405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $O\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right) + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \: \mathbf{KL}\left( {\mathbf{\theta}^{(\ell+1)}}\mid\mid{\mathbf{\theta}^{(\ell)}}\right)}}\right)$, where $D$ is the largest MDP diameter from the set of MDPs defining the piecewise stationary MDP setting, $O$ is the finite number of states (constant over all changes), $A$ is the finite number of actions (constant over all changes), $K_T$ is the number of change points up to horizon $T$, and $\mathbf{\theta}^{(\ell)}$ is the transition kernel during the interval $[c_\ell, c_{\ell+1})$, which we assume to be multinomially distributed over the set of states $\mathbb{O}$. Interestingly, the performance bound does not directly scale with the variation in MDP state transition distributions and rewards, ie. can also model abrupt changes. In practice, R-BOCPD-UCRL2 outperforms the state-of-the-art in a variety of scenarios in synthetic environments. We provide a detailed experimental setup along with a code repository (upon publication) that can be used to easily reproduce our experiments.
Abstract（参考訳）: 本稿では,非定常強化学習(RL)環境における学習の問題点について考察する。本稿では,より一般的なマルチミリ波分布から得られる入力ストリームを演算し,疑似アラームレートと検出遅延の観点からほぼ最適理論的保証を提供するRestarted Bayesian Online Change-Point Detectionアルゴリズム(R-BOCPD)を提案する。そこで本研究では,マルチノード分布からサンプル化した状態遷移カーネルをR-BOCPD-UCRL2と呼ぶMPP用UCRL2アルゴリズムの改良版を提案する。 We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $O\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right) + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \: \mathbf{KL}\left( {\mathbf{\theta}^{(\ell+1)}}\mid\mid{\mathbf{\theta}^{(\ell)}}\right)}}\right)$, where $D$ is the largest MDP diameter from the set of MDPs defining the piecewise stationary MDP setting, $O$ is the finite number of states (constant over all changes), $A$ is the finite number of actions (constant over all changes), $K_T$ is the number of change points up to horizon $T$, and $\mathbf{\theta}^{(\ell)}$ is the transition kernel during the interval $[c_\ell, c_{\ell+1})$, which we assume to be multinomially distributed over the set of states $\mathbb{O}$. 興味深いことに、パフォーマンスバウンダリは、MDP状態遷移の分布と報酬のばらつきによって直接スケールしない。突然の変化もモデル化できます実際には、r-bocpd-ucrl2は合成環境における様々なシナリオにおいて最先端技術を上回る。実験の再現に使用できるコードリポジトリ(upon publication)とともに、詳細な実験セットアップを提供しています。

論文の概要: Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes

関連論文リスト