Fugu-MT 論文翻訳(概要): Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

論文の概要: Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

arxiv url: http://arxiv.org/abs/2012.08005v3
Date: Sun, 25 Apr 2021 22:13:37 GMT
ステータス: 翻訳完了
システム内更新日: 2021-05-08 14:42:14.325831
Title: Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL
Title（参考訳）: バッチ強化学習のための指数的下界:バッチRLはオンラインRLよりも指数的に難しい
Authors: Andrea Zanette
Abstract要約: この作業は、新しいoracle + batch algorithm'フレームワークを導入して、すべてのディストリビューションに保持される下限を証明する。技術的なレベルでは、この研究はEmphdeadly triadとして知られる問題を形式化し、Emphbootstrapping問題はRLのEmphextrapolation問題よりも深刻な可能性があると説明する。
参考スコア（独自算出の注目度）: 11.95134088233295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive \emph{exponential} information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) \emph{realizability} holds, 2) the batch algorithm observes the exact reward and transition \emph{functions}, and 3) the batch algorithm is given the \emph{best} a priori data distribution for the problem class. Furthermore, if the dataset does not come from policy rollouts then the lower bounds hold even if the action-value function of \emph{every} policy admits a linear representation. If the objective is to find a near-optimal policy, we discover that these hard instances are easily solved by an \emph{online} algorithm, showing that there exist RL problems where \emph{batch RL is exponentially harder than online RL} even under the most favorable batch data distribution. In other words, online exploration is critical to enable sample efficient RL with function approximation. A second corollary is the exponential separation between finite and infinite horizon batch problems under our assumptions. On a technical level, this work introduces a new `oracle + batch algorithm' framework to prove lower bounds that hold for every distribution, and automatically recovers traditional fixed distribution lower bounds as a special case. Finally this work helps formalize the issue known as \emph{deadly triad} and explains that the \emph{bootstrapping} problem \citep{sutton2018reinforcement} is potentially more severe than the \emph{extrapolation} issue for RL because unlike the latter, bootstrapping cannot be mitigated by adding more samples.
Abstract（参考訳）: 強化学習のいくつかの実践的応用は、エージェントが過去のデータから学習することを含む。多くの場合、これらのアプリケーションでは、1)ほぼ最適なポリシーを特定したり、2)ターゲットポリシーの価値を見積もる必要があります。いずれのタスクに対しても, 1 \emph{realizability} が成立しても,アクション値関数に対する線形関数表現を持つディスカウント無限大地平線mdpにおける情報理論的下限である \emph{exponential} を導出し, 2) バッチアルゴリズムは厳密な報酬と遷移を観測し, 3) バッチアルゴリズムは問題クラスに対する事前データ分布として \emph{best} を与えられる。さらに、もしデータセットがポリシーのロールアウトから来ない場合、下界は、もし \emph{every} ポリシーのアクション値関数が線形表現を許すとしても保持する。目的が最適に近いポリシーを見つけることなら、これらのハードインスタンスは \emph{online} アルゴリズムで容易に解けることを発見し、最も好ましいバッチデータ分布下であっても、オンライン RL よりも指数関数的に困難であるような RL 問題が存在することを示した。言い換えれば、オンライン探索は、関数近似を用いたサンプル効率のよいRLを実現するために重要である。第二の補題は、仮定の下で有限と無限のホライズンバッチ問題の指数的分離である。技術的レベルでは、この研究は、すべての分布に保持される下位境界を証明し、特別なケースとして従来の固定分布の下限を自動的に回収する新しい 'oracle + batch algorithm' フレームワークを導入している。最後に、この研究は \emph{deadly triad} として知られる問題を形式化し、 \emph{bootstrapping} 問題 \citep{sutton2018reinforcement} が RL の \emph{extrapolation} 問題よりも深刻な可能性があると説明する。

関連論文リスト

When is Agnostic Reinforcement Learning Statistically Tractable? [76.1408672715773]
エンフスパンニング容量と呼ばれる新しい複雑性測度は、設定された$Pi$にのみ依存し、MDPダイナミクスとは独立である。我々は、学習するためにスーパーポリノミカルな数のサンプルを必要とする制限付きスパンリング能力を持つポリシークラス$Pi$が存在することを示した。これにより、生成的アクセスとオンラインアクセスモデルの間の学習可能性の驚くほどの分離が明らかになる。
論文参考訳（メタデータ） (2023-10-09T19:40:54Z)
On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation [80.86358123230757]
本稿では,Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI) というアルゴリズムを提案する。部分的なデータカバレッジの仮定の下で、BCP-VI は最適な Q-値関数に正のギャップがあるときに、オフライン RL に対して $tildemathcalO(frac1K)$ の高速レートを得る。これらは、アダプティブデータからの線形関数近似を持つオフラインRLに対してそれぞれ、最初の$tildemathcalO(frac1K)$boundと絶対零部分最適境界である。
論文参考訳（メタデータ） (2022-11-23T18:50:44Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
我々はPessimistic vAlue iteRaTionとrEward Decomposition (PARTED)という新しいオフライン強化学習アルゴリズムを提案する。 PartEDは、最小2乗ベースの報酬再分配を通じて、ステップごとのプロキシ報酬に軌道を分解し、学習したプロキシ報酬に基づいて悲観的な値を実行する。私たちの知る限りでは、PartEDは、トラジェクティブな報酬を持つ一般のMDPにおいて、証明可能な効率のよい最初のオフラインRLアルゴリズムである。
論文参考訳（メタデータ） (2022-06-13T19:11:22Z)
Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
半可観測マルコフ決定過程におけるオフライン強化学習(RL)について検討する。本稿では,UnderlineProxy変数 underlinePessimistic UnderlinePolicy UnderlineOptimization (textttP3O)アルゴリズムを提案する。 textttP3Oは、確立されたデータセットを持つPOMDPのための証明可能な最初のオフラインRLアルゴリズムである。
論文参考訳（メタデータ） (2022-05-26T19:13:55Z)
Towards Instance-Optimal Offline Reinforcement Learning with Pessimism [34.54294677335518]
我々は、未知マルコフ決定過程(MDP)における報酬最大化ポリシーの学習を目標とするオフライン強化学習(オフラインRL)問題について検討する。本研究では、適応悲観的値反復法(APVI)アルゴリズムを分析し、[Oleft(sum_h=1Hsum_s_h,a_hdpistar_h(s_h,a_h)sqrtfracmathrmmathrmVar_]とほぼ一致する準最適上限を導出する。
論文参考訳（メタデータ） (2021-10-17T01:21:52Z)
Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits [45.43968161616453]
バッチ線形文脈帯域に対する最適バッチ-regretトレードオフについて検討する。時間的地平線が成長するにつれて2相表現を特徴とする後悔の保証を証明します。また, 動的上界に依存した新しい行列不等式濃度を証明した。
論文参考訳（メタデータ） (2021-10-15T12:32:33Z)
Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations [79.66404989555566]
我々は、リッチな観測空間を持つより現実的な非依存的RLの設定と、近似的ポリシーを含まないような固定されたポリシーのクラス$Pi$を考える。我々は,MDPの階数$d$の誤差が有界な設定のためのアルゴリズムを提案する。
論文参考訳（メタデータ） (2021-06-22T03:20:40Z)
Online Sub-Sampling for Reinforcement Learning with General Function Approximation [111.01990889581243]
本稿では,RLアルゴリズムによって収集されたデータポイントの情報取得量を測定する,効率的なオンラインサブサンプリングフレームワークを確立する。複雑性バウンド関数クラスを持つ値ベースのメソッドの場合、$proptooperatornamepolylog(K)$ timesに対してのみポリシーを更新する必要がある。少なくとも$Omega(K)$倍のポリシーを更新する既存のアプローチとは対照的に、当社のアプローチはポリシーの解決における最適化コールの数を劇的に削減します。
論文参考訳（メタデータ） (2021-06-14T07:36:25Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。