Fugu-MT 論文翻訳(概要): Learning Near Optimal Policies with Low Inherent Bellman Error

論文の概要: Learning Near Optimal Policies with Low Inherent Bellman Error

arxiv url: http://arxiv.org/abs/2003.00153v3
Date: Sun, 28 Jun 2020 23:46:53 GMT
ステータス: 翻訳完了
システム内更新日: 2022-12-27 20:08:08.637767
Title: Learning Near Optimal Policies with Low Inherent Bellman Error
Title（参考訳）: 低固有ベルマン誤差による最適政策に近い学習
Authors: Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill
Abstract要約: エピソード強化学習における近似線形作用値関数を用いた探索問題について検討する。我々は,検討した設定に対して最適な統計率を達成するアルゴリズムを用いて,Emphbatch仮定のみを用いて探索を行うことが可能であることを示す。
参考スコア（独自算出の注目度）: 115.16037976819331
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H \sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes, $\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature dimension at timestep $t$. In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting. Finally, the algorithm reduces to the celebrated \textsc{LinUCB} when $H=1$ but with a different choice of the exploration parameter that allows handling misspecified contextual linear bandits. While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement learning is possible.
Abstract（参考訳）: 近似値反復の収束を示すために通常用いられるベルマン誤差の低い条件の下で, エピソディック強化学習における近似線形作用値関数を用いた探索問題について検討する。まず、この条件を他の共通フレームワークに関連付け、前処理の低階(または線形)の MDP の仮定よりも厳密に一般化されていることを示す。次に、$\widetilde o(\sum_{t=1}^h d_t \sqrt{k} + \sum_{t=1}^h \sqrt{d_t} \ibe k)$ ここで$h$は地平線、$k$はエピソード数、$\ibe$は固有のベルマンエラーと$d_t$がタイムステップ$t$の特徴次元である場合の値である。さらに, 一致した下界を示すことにより, 結果が定数やログを超えないことを示す。これは2つの重要な結果をもたらす。 1) 低ランクmdpにおける従来の作業よりも一般的である設定の最適統計量を達成するアルゴリズムを用いて, \emph{batch assumptions} のみを用いて探索が可能であることを示す。 2) 閉性(ベルマン誤差によって測定される)の欠如は、オンライン環境での作業にもかかわらず$\sqrt{d_t}$でのみ増幅される。最後に、このアルゴリズムは、$H=1$のときの有名な \textsc{LinUCB} に還元されるが、不特定コンテキスト線形帯域を扱うことができる探索パラメータの異なる選択を持つ。 MDPの設定には計算的トラクタビリティの問題がまだ残っているが、これは統計的に効率的な強化学習が可能なアクション値関数の線形表現でMDPのクラスを豊かにする。

論文の概要: Learning Near Optimal Policies with Low Inherent Bellman Error

関連論文リスト