Fugu-MT 論文翻訳(概要): Provably Efficient Reinforcement Learning via Surprise Bound

論文の概要: Provably Efficient Reinforcement Learning via Surprise Bound

arxiv url: http://arxiv.org/abs/2302.11634v1
Date: Wed, 22 Feb 2023 20:21:25 GMT
ステータス: 翻訳完了
システム内更新日: 2023-02-24 16:56:52.678599
Title: Provably Efficient Reinforcement Learning via Surprise Bound
Title（参考訳）: サプライズバウンドによる効果的な強化学習
Authors: Hanlin Zhu, Ruosong Wang, Jason D. Lee
Abstract要約: 本稿では,一般値関数近似を用いた効率の良い強化学習アルゴリズムを提案する。本アルゴリズムは, 線形設定と疎高次元線形設定の両方に適用した場合に, 合理的な後悔境界を達成できる。
参考スコア（独自算出の注目度）: 66.15308700413814
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large. Despite the importance and wide applicability of value function approximation, its theoretical understanding is still not as sophisticated as its empirical success, especially in the context of general function approximation. In this paper, we propose a provably efficient RL algorithm (both computationally and statistically) with general value function approximations. We show that if the value functions can be approximated by a function class that satisfies the Bellman-completeness assumption, our algorithm achieves an $\widetilde{O}(\text{poly}(\iota H)\sqrt{T})$ regret bound where $\iota$ is the product of the surprise bound and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes and $T = HK$ is the total number of steps the agent interacts with the environment. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting. Moreover, our algorithm only needs to solve $O(H\log K)$ empirical risk minimization (ERM) problems, which is far more efficient than previous algorithms that need to solve ERM problems for $\Omega(HK)$ times.
Abstract（参考訳）: 値関数近似は、特に状態空間が(無限に)大きい場合、現代の強化学習(RL)問題において重要である。価値関数近似の重要性と幅広い適用性にもかかわらず、その理論的理解は、特に一般関数近似の文脈において、経験的成功ほど洗練されていない。本稿では,一般値関数近似を用いた確率効率の良いRLアルゴリズムを提案する。値関数がベルマン完全性仮定を満たす関数クラスで近似できるなら、このアルゴリズムは$\widetilde{o}(\text{poly}(\iota h)\sqrt{t})$ regret bound($\iota$がサプライズバウンドとログ被覆数の積、$h$が計画地平線、$k$がエピソード数、$t = hk$がエージェントが環境と相互作用するステップの総数であることを示す。本アルゴリズムは,線形設定とスパース高次元線形設定の両方に適用することで,合理的な後悔の限界を達成する。さらに,本アルゴリズムは,それまでのerm問題を$\omega(hk)$で解く必要のあるアルゴリズムに比べてはるかに効率的であり,経験的リスク最小化 (erm) 問題のみを解く必要がある。

関連論文リスト

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
我々は,$widetildemathrmO(sqrtmathrmsp(h*) S A T)$のミニマックス最適後悔を伴う最初の抽出可能なアルゴリズムを提案する。注目すべきは、我々のアルゴリズムは$mathrmsp(h*)$に関する事前情報を必要としないことである。
論文参考訳（メタデータ） (2024-06-03T11:53:44Z)
Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation [67.8414514524356]
本稿では,MNL関数近似を用いたMDPの新しいクラスについて検討し,状態空間上の確率分布の正当性を保証する。非線型関数の導入は、計算効率と統計効率の両方において大きな課題を提起する。我々は,$mathcalO(1)$$コストで同じ後悔を実現するアルゴリズムを提案する。
論文参考訳（メタデータ） (2024-05-27T11:31:54Z)
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [66.26739783789387]
我々は、強化学習のための新しいアルゴリズム、MQL-UCBを用いたモノトニックQ-Learningを提案する。 MQL-UCBは、$tildeO(dsqrtHK)$の最小限の後悔を実現する。本研究は,非線形関数近似を用いたサンプル効率およびデプロイメント効率のよいQ-ラーニングの設計に重点を置いている。
論文参考訳（メタデータ） (2023-11-26T08:31:57Z)
Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
我々は,損失関数が約1,300ドル以上のエピソードに対して任意に変化するような,敵対的決定過程(MDP)の学習を検討する。本稿では,同じ設定で$tildemathcal O(K2/3)$に対する後悔を改善する2つのアルゴリズムを提案する。
論文参考訳（メタデータ） (2023-01-30T14:37:21Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
本研究は,RL(Human-in-the-loop reinforcement learning)を軌道的嗜好で検討する。各ステップで数値的な報酬を受ける代わりに、エージェントは人間の監督者から軌道上のペアよりも優先される。一般関数近似を用いたPbRLの楽観的モデルベースアルゴリズムを提案する。
論文参考訳（メタデータ） (2022-05-23T09:03:24Z)
Improved Regret Bound and Experience Replay in Regularized Policy Iteration [22.621710838468097]
無限ホライゾンマルコフ決定過程(mdps)における学習アルゴリズムを関数近似を用いて検討する。まず、ほぼ同一の仮定の下で、Politexアルゴリズムの後悔解析を$O(T3/4)$から$O(sqrtT)$にシャープできることを示す。その結果、計算効率の良いアルゴリズムに対して、最初の高い確率の$o(sqrtt)$ regretバウンドが得られる。
論文参考訳（メタデータ） (2021-02-25T00:55:07Z)
Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation [92.3161051419884]
我々は、敵対的な報酬と完全な情報フィードバックで有限正方体エピソディックマルコフ決定プロセスのための強化学習を研究します。我々は、$tildeO(dHsqrtT)$ regretを達成できることを示し、$H$はエピソードの長さである。また、対数因子までの$tildeOmega(dHsqrtT)$の値が一致することを証明する。
論文参考訳（メタデータ） (2021-02-17T18:54:08Z)
Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
一般値関数近似を用いた効率の良い強化学習アルゴリズムを確立する。我々のアルゴリズムは、$d$が複雑性測度である場合、$widetildeO(mathrmpoly(dH)sqrtT)$の後悔の限界を達成することを示す。我々の理論は線形値関数近似によるRLの最近の進歩を一般化し、環境モデルに対する明示的な仮定をしない。
論文参考訳（メタデータ） (2020-05-21T17:36:09Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。