Fugu-MT 論文翻訳(概要): Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

論文の概要: Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

arxiv url: http://arxiv.org/abs/2009.13503v2
Date: Wed, 30 Jun 2021 07:06:10 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-13 21:15:27.781050
Title: Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon
Title（参考訳）: 強化学習はバンドよりも困難か? 地平線の呪いを逃れる近似最適アルゴリズム
Authors: Zihan Zhang, Xiangyang Ji, Simon S. Du
Abstract要約: エピソード強化学習は文脈的包帯を一般化する。長期計画の地平線と未知の状態依存的な遷移は、サンプルの複雑さに若干の困難をもたらす。 MVPは$left(sqrtSAK + S2Aright)$ regretを楽しみ、$Omegaleft(sqrtSAK + S2Aright)$ lower bound of emphcontextual bandits to logarithmic termsに近づいている。
参考スコア（独自算出の注目度）: 88.75843804630772
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with $S$ states, $A$ actions, planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. Compared to existing bonus constructions, the new bonus is tighter since it is based on a well-designed monotonic value function. In particular, the \emph{constants} in the bonus should be subtly setting to ensure optimism and monotonicity. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $\Omega\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits} up to logarithmic terms. Notably, this result 1) \emph{exponentially} improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019] and Zanette et al. [2019] in terms of the dependency on $H$, and 2) \emph{exponentially} improves the running time in [Wang et al. 2020] and significantly improves the dependency on $S$, $A$ and $K$ in sample complexity.
Abstract（参考訳）: エピソディック強化学習と文脈的バンディットは、一連の意思決定問題として広く研究されている。エピソディック強化学習は、文脈的包帯を一般化し、長い計画の地平線と未知の状態依存の遷移のために、しばしば困難であるとみなされる。現在の論文は、長い計画の地平線と未知の状態依存の遷移(多くは)が、サンプルの複雑さにわずかな困難をもたらすことを示している。我々は、$S$状態、$A$アクション、プランニング水平線$H$、合計報酬$1$制限付き、エージェントが$K$エピソードをプレイする、という叙事的な強化学習について検討する。我々は,新しいベルンシュタイン型ボーナスに依存する新しいアルゴリズム, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP)を提案する。既存のボーナス構造と比較すると、よく設計された単調値関数に基づいているため、新しいボーナスはより厳密である。特に、ボーナスの \emph{constants} は、楽観性と単調性を保証するために微妙に設定されるべきである。 MVP は $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret を楽しみ、$Omega\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits} を対数項に近づく。特にこの結果は 1) \emph{exponentially}はdannらによる最先端多項式時間アルゴリズムを改善する。 Zanette et al. [2019] and Zanette et al. [2019]$H$および$H$への依存性の観点で 2) \emph{exponentially}は[Wang et al. 2020]の実行時間を改善し、サンプル複雑性における$S$、$A$、$K$への依存性を大幅に改善します。

論文の概要: Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

関連論文リスト