Fugu-MT 論文翻訳(概要): Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

論文の概要: Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

arxiv url: http://arxiv.org/abs/2110.04645v1
Date: Sat, 9 Oct 2021 21:13:48 GMT
ステータス: 翻訳完了
システム内更新日: 2021-10-12 16:42:43.041986
Title: Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning
Title（参考訳）: 適応型モデルフリー強化学習におけるサンプル複雑性障壁の破却
Authors: Gen Li, Laixi Shi, Yuxin Chen, Yuantao Gu, Yuejie Chi
Abstract要約: 漸進的強化学習における後悔を最小限に抑えるために,新しいモデルフリーアルゴリズムを提案する。提案アルゴリズムは、2つのQ-ラーニングシーケンスの助けを借りて、初期設定された参照更新ルールを用いる。初期の分散還元法の設計原理は、他のRL設定とは独立した関心を持つかもしれない。
参考スコア（独自算出の注目度）: 52.76230802067506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves -- by at least a factor of $S^5A^3$ -- upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called {\em reference-advantage decomposition}), the proposed algorithm employs an {\em early-settled} reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.
Abstract（参考訳）: オンラインエピソジック強化学習(rl)におけるサンプル効率の達成には、探索と搾取の最適バランスが必要である。 s$ 状態、$a$ アクション、地平線の長さ $h$ を持つ有限ホリゾンのエピソディックマルコフ決定プロセスに関しては、サンプル総数 $t$ で$\sqrt{h^2sat}$ (modulo log factor) の順にスケールするminimax-optimal regret を特徴付けるためにかなりの進歩があった。いくつかの競合する解パラダイムは後悔を最小限に抑えるために提案されているが、それらはメモリ非効率であるか、サンプルサイズが巨大なしきい値(例えば、既存のモデルフリーメソッドでは$S^6A^4 \,\mathrm{poly}(H)$)を超えない限り、最適性に欠ける。このような大きなサンプルサイズ障壁を克服して効率的なRLを実現するために、サンプルサイズが$SA\,\mathrm{poly}(H)$を超えると、ほぼ最適に後悔する空間複雑性$O(SAH)$で新しいモデルフリーアルゴリズムを設計する。このサンプルサイズ要件(初期バーンインコストとも呼ばれる)に関して、我々の手法は、漸近的に後悔と最適化される以前のメモリ効率アルゴリズムに対して、少なくとも$S^5A^3$の係数で改善する。提案アルゴリズムは,最近導入された分散低減戦略(「vs-reference-advantage decomposition」とも呼ばれる)を活用し,上位と下位の信頼境界を持つ2つのQ-ラーニングシーケンスの助けを借りて,初期セットされた参照更新ルールを採用する。初期の分散還元法の設計原理は、複雑な探査・探査のトレードオフを含む他のRL設定と独立して関心を持つかもしれない。

関連論文リスト

Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL [6.996002801232415]
生成モデルを用いてマルコフ決定過程(MDP)において,$varepsilon$-optimal Policyを求める際のサンプル複雑性について検討した。我々は,知識を必要とせず,最適なスパンベース複雑性に適合するアルゴリズムを開発した。
論文参考訳（メタデータ） (2025-02-16T19:10:55Z)
Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
モデルフリーのステージベースQ-ラーニングアルゴリズムはモデルベースアルゴリズムと同じ$H$依存の最適性を享受できることを示す。本アルゴリズムは,楽観的値関数と悲観的値関数のペアとして参照値関数を更新するキーとなる新しい設計を特徴とする。
論文参考訳（メタデータ） (2023-08-17T08:34:58Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
バーンインコストを発生させることなく、最小限の最適後悔を実現する方法を示す。最適値/コストや一定の分散といった問題依存量の影響を明らかにするために、我々の理論を拡張します。
論文参考訳（メタデータ） (2023-07-25T15:42:11Z)
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds [24.571530193140916]
エントロピーリスク尺度(EntRM)が目的である有限エピソードマルコフ決定過程を考察する。モデルフリーとモデルベースを含む2つの異なるスキームを用いて最適化を実装する2つの新しいDRLアルゴリズムを提案する。いずれも$tildemathcalO(fracexp(|beta|H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, $H$は数値を表す。
論文参考訳（メタデータ） (2022-10-25T14:30:48Z)
Settling the Sample Complexity of Model-Based Offline Reinforcement Learning [50.5790774201146]
オフライン強化学習(RL)は、事前収集されたデータを用いて、さらなる探索を行わずに学習する。事前のアルゴリズムや分析は、最適なサンプルの複雑さに悩まされるか、サンプルの最適性に到達するために高いバーンインコストがかかるかのいずれかである。モデルベース(あるいは"プラグイン")アプローチは,バーンインコストを伴わずに,最小限のサンプル複雑性を実現することを実証する。
論文参考訳（メタデータ） (2022-04-11T17:26:19Z)
Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning? [30.065091907118827]
本研究は,マルコフ決定過程(MDP)における$epsilon$-optimal Policyの発見の複雑さについて考察する。実験モデルを構築し,任意のプラグインソルバを用いて実験モデルを計画するプラグインソルバ手法を用いてこの問題を解決する。プラグインアプローチはサンプル効率も向上し,強化学習のためのモデルベースアルゴリズムを設計するための柔軟なアプローチを提供する。
論文参考訳（メタデータ） (2020-10-12T13:13:01Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
モデルに基づくMARLは、Nash平衡値(NE)を求めるために$tilde O(|S||B|(gamma)-3epsilon-2)$のサンプル複雑性を実現する。また、アルゴリズムが報酬に依存しない場合、そのようなサンプル境界は最小値(対数因子まで)であり、アルゴリズムは報酬知識のない遷移サンプルを問合せする。
論文参考訳（メタデータ） (2020-07-15T03:25:24Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
本稿では、生成モデル(シミュレータ)へのアクセスを想定して、強化学習のサンプル効率について検討する。最初に$gamma$-discounted infinite-horizon Markov decision process (MDPs) with state space $mathcalS$ and action space $mathcalA$を考える。対象の精度を考慮すれば,モデルに基づく計画アルゴリズムが最小限のサンプルの複雑さを実現するのに十分であることを示す。
論文参考訳（メタデータ） (2020-05-26T17:53:18Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。