Fugu-MT 論文翻訳(概要): On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

論文の概要: On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

arxiv url: http://arxiv.org/abs/2106.09973v1
Date: Fri, 18 Jun 2021 07:54:23 GMT
ステータス: 翻訳完了
システム内更新日: 2021-06-21 14:21:37.584903
Title: On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data
Title（参考訳）: 政策データを用いたバッチ強化学習の複雑さについて
Authors: Chenjun Xiao, Ilbin Lee, Bo Dai, Dale Schuurmans, Csaba Szepesvari
Abstract要約: 有限マルコフ決定過程(MDP)におけるよい政策を学習する際のサンプル複雑性の基本的な問題について検討する。本研究の主目的は,計画的地平線$H$が有限である場合,適切な政策を得るのに必要な最小限の遷移数であるサンプル複雑性が,関連する量の指数関数であることを示す。
参考スコア（独自算出の注目度）: 80.47164498884271
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the fundamental question of the sample complexity of learning a good policy in finite Markov decision processes (MDPs) when the data available for learning is obtained by following a logging policy that must be chosen without knowledge of the underlying MDP. Our main results show that the sample complexity, the minimum number of transitions necessary and sufficient to obtain a good policy, is an exponential function of the relevant quantities when the planning horizon $H$ is finite. In particular, we prove that the sample complexity of obtaining $\epsilon$-optimal policies is at least $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H+1)})$ for $\gamma$-discounted problems, where $\mathrm{S}$ is the number of states, $\mathrm{A}$ is the number of actions, and $H$ is the effective horizon defined as $H=\lfloor \tfrac{\ln(1/\epsilon)}{\ln(1/\gamma)} \rfloor$; and it is at least $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$ for finite horizon problems, where $H$ is the planning horizon of the problem. This lower bound is essentially matched by an upper bound. For the average-reward setting we show that there is no algorithm finding $\epsilon$-optimal policies with a finite amount of data.
Abstract（参考訳）: 有限マルコフ決定過程(MDP)において、学習に利用可能なデータが、基礎となるMDPを知らずに選択しなければならないロギングポリシーに従うことによって得られる場合、よい政策を学ぶためのサンプルの複雑さに関する根本的な疑問を考察する。本研究の主目的は,計画的地平線$H$が有限である場合,適切な政策を得るために必要な最小限の遷移数であるサンプル複雑性が,関連する量の指数関数であることを示す。 In particular, we prove that the sample complexity of obtaining $\epsilon$-optimal policies is at least $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H+1)})$ for $\gamma$-discounted problems, where $\mathrm{S}$ is the number of states, $\mathrm{A}$ is the number of actions, and $H$ is the effective horizon defined as $H=\lfloor \tfrac{\ln(1/\epsilon)}{\ln(1/\gamma)} \rfloor$; and it is at least $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$ for finite horizon problems, where $H$ is the planning horizon of the problem. この下界は基本的に上界と一致する。平均帰納的な設定では、有限のデータ量で$\epsilon$-optimalポリシーを見つけるアルゴリズムは存在しない。

関連論文リスト

Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
結果に基づくフィードバックによる強化学習は、根本的な課題に直面します。適切なアクションにクレジットを割り当てるには? 本稿では,一般関数近似を用いたオンラインRLにおけるこの問題の包括的解析を行う。
論文参考訳（メタデータ） (2025-05-26T17:44:08Z)
Top-$nσ$: Not All Logits Are You Need [25.133593066927794]
ソフトマックス前のロジットを直接操作する新しいサンプリング手法である Top-nsigma$ を導入する。温度スケーリングにかかわらず,トップ$nsigma$は安定したサンプリング空間を維持していることを示す。また、その振る舞いをよりよく理解するために、トップ$nsigma$の理論分析も提供する。
論文参考訳（メタデータ） (2024-11-12T08:46:43Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
条件分布は機械学習の中心的な問題ですペアデータとペアデータの両方を統合する新しい学習パラダイムを提案する。我々のアプローチはまた、興味深いことに逆エントロピー最適輸送(OT)と結びついている。
論文参考訳（メタデータ） (2024-10-03T16:12:59Z)
Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge [0.704590071265998]
オンラインQ-ラーニング手法のサンプル複雑性について,動的知識が利用可能であったり,効率的に学習できたりした場合に検討する。我々は,$f$の完全知識の下で,$tildemathcalO(textPoly(H)sqrtSAT)$ regretを達成する楽観的なQ-ラーニングアルゴリズムを提案する。
論文参考訳（メタデータ） (2023-12-19T19:53:58Z)
Multi-Task Imitation Learning for Linear Dynamical Systems [50.124394757116605]
線形システム上での効率的な模倣学習のための表現学習について検討する。学習対象ポリシーによって生成された軌道上の模倣ギャップは、$tildeOleft(frack n_xHN_mathrmshared + frack n_uN_mathrmtargetright)$で制限されている。
論文参考訳（メタデータ） (2022-12-01T00:14:35Z)
Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
我々は、オリジナルのトレーニング例が忘れられた後(もしあれば)のエポックを追跡する補足的メトリックである$$-second-$split$$forgetting$$$time$ (SSFT)を提案する。例えば$mislabeled$の例はすぐに忘れられ、$rare$の例は比較的ゆっくりと忘れられています。 SSFTは、(i)間違ったラベル付きサンプルを識別し、その除去により一般化が向上し、(ii)障害モードに関する洞察を提供する。
論文参考訳（メタデータ） (2022-10-26T21:03:46Z)
Tractable Optimality in Episodic Latent MABs [75.17357040707347]
我々は、エージェントが時間ステップ$H$のエピソードのために環境と対話する、M$遅延コンテキストを持つマルチアームバンディット問題を考える。エピソードの長さによっては、学習者は遅れた文脈を正確に見積もることができないかもしれない。我々は、$O(textttpoly(A) + textttpoly(M,H)min(M,H))$インタラクションを用いて、ほぼ最適なポリシーを確実に学習する手順を設計する。
論文参考訳（メタデータ） (2022-10-05T22:53:46Z)
Revealing Unobservables by Deep Learning: Generative Element Extraction Networks (GEEN) [5.3028918247347585]
本稿では,ランダムサンプル中の潜伏変数$X*$を推定する新しい手法を提案する。我々の知る限りでは、この論文は観測においてそのような識別を初めて提供するものである。
論文参考訳（メタデータ） (2022-10-04T01:09:05Z)
A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation [16.29514743112387]
最適値関数のみを線形化可能な設定において、サンプル効率のよい強化学習(RL)について検討する。専門的なクエリと探索をブレンドするための統計的・計算学的に効率的なアルゴリズム(Delphi)を提案する。 Delphi には $tildemathcalO(d)$ エキスパートクエリと $texttpoly(d,|mathcalA|,1/varepsilon)$ 探索サンプルの量が必要です。
論文参考訳（メタデータ） (2022-07-18T01:39:13Z)
Indirect Active Learning [7.84669346764821]
局所的にX$とY$の関係を推定するためのミニマックス収束率について検討する。多くの場合、アクティブな学習には利点があるが、この利点は2つの受動的実験を連続して実行する単純な2段階学習者によって完全に実現されている。
論文参考訳（メタデータ） (2022-06-03T08:37:35Z)
Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
既知の滑らかな関数 $f$ の値を未知の点 $boldsymbolmu in mathbbRn$ で推定する問題について検討する。我々は、各座標の重要性に応じてサンプルを学習するインスタンス適応アルゴリズムを設計し、少なくとも1-delta$の確率で$epsilon$の正確な推定値である$f(boldsymbolmu)$を返す。
論文参考訳（メタデータ） (2022-03-18T18:50:52Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。