Fugu-MT 論文翻訳(概要): Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

論文の概要: Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

arxiv url: http://arxiv.org/abs/2210.04946v1
Date: Mon, 10 Oct 2022 18:34:32 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-12 13:35:29.718077
Title: Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path
Title（参考訳）: 目標達成は難しい - 確率的最短経路のサンプル複雑性を解決する
Authors: Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
Abstract要約: 本稿では,最短経路(SSP)問題において,$epsilon$-optimal Policyを学習する際のサンプル複雑性について検討する。学習者が生成モデルにアクセスできる場合、複雑性境界を導出する。我々は、$S$状態、$A$アクション、最小コスト$c_min$、およびすべての状態に対する最適ポリシーの最大期待コストを持つ最悪のSSPインスタンスが存在することを示す。
参考スコア（独自算出の注目度）: 106.37656068276902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the sample complexity of learning an $\epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$ samples to return an $\epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min}=0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this result with lower bounds when prior knowledge of the hitting time of the optimal policy is available and when we restrict optimality by competing against policies with bounded hitting time. Finally, we design an algorithm with matching upper bounds in these cases. This settles the sample complexity of learning $\epsilon$-optimal polices in SSP with generative models. We also initiate the study of learning $\epsilon$-optimal policies without access to a generative model (i.e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general. On the other hand, efficient learning can be made possible if we assume the agent can directly reach the goal state from any state by paying a fixed cost. We then establish the first upper and lower bounds under this assumption. Finally, using similar analytic tools, we prove that horizon-free regret is impossible in SSPs under general costs, resolving an open problem in (Tarbouriech et al., 2021c).
Abstract（参考訳）: 確率的短経路 (ssp) 問題における$\epsilon$-optimal policy の学習のサンプル複雑性について検討した。まず,学習者が生成モデルにアクセスできる場合に,サンプルの複雑性境界を導出する。 S$状態、$A$アクション、最小コスト$c_{\min}$、および全ての状態に対する最適ポリシーの最大期待コスト$B_{\star}$、任意のアルゴリズムが、高い確率で$\epsilon$-Optimalポリシーを返すために少なくとも$\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2)のサンプルを必要とする、最悪のSSPインスタンスが存在することを示す。驚くべきことに、$c_{\min}=0$のSSP問題はいつでも学習できないので、SSPでの学習は有限ホリゾンや割引設定よりも厳密である。この結果は、最適政策の打点時間に関する事前知識が利用可能である場合や、限界打点時間を持つ政策と競合することによって最適性を制限した場合に、低い限界で補完する。最後に,これらの場合の上限値に一致するアルゴリズムを設計する。これにより、SSPにおける$\epsilon$-optimal Policesを生成モデルで学習する際の複雑さが解決される。また、生成モデルにアクセスせずに$\epsilon$-optimalポリシーを学習する研究(いわゆる最良の政治識別問題)を開始し、サンプル効率のよい学習は一般に不可能であることを示す。一方で、エージェントが固定コストを払えば、任意の状態から直接目標状態に到達することができると仮定すれば、効率的な学習が可能になる。そして、この仮定の下で第一上界と下界を定めます。最後に、同様の分析ツールを用いて、一般コスト下でのSSPでは地平面自由後悔は不可能であることが証明され(Tarbouriech et al., 2021c)。

論文の概要: Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

関連論文リスト