Fugu-MT 論文翻訳(概要): Toward the Fundamental Limits of Imitation Learning

論文の概要: Toward the Fundamental Limits of Imitation Learning

arxiv url: http://arxiv.org/abs/2009.05990v1
Date: Sun, 13 Sep 2020 12:45:52 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-19 02:34:55.518290
Title: Toward the Fundamental Limits of Imitation Learning
Title（参考訳）: 模倣学習の基本限界に向けて
Authors: Nived Rajaraman, Lin F. Yang, Jiantao Jiao, Kannan Ramachandran
Abstract要約: シミュレーション学習(IL)は、実演のみを与えられた逐次的な意思決定問題において、専門家の政策の振る舞いを模倣することを目的としている。まず,学習者が事前に$N$のエキスパートトラジェクトリのデータセットを提供して,MDPと対話できないような設定について検討する。可能な限り専門家を模倣するポリシーは、専門家が任意のポリシーに従う場合でも、専門家の値と比較すると、$lesssim frac|mathcalS| H2 log (N)N$ suboptimalであることを示す。
参考スコア（独自算出の注目度）: 29.87139380803829
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imitation learning (IL) aims to mimic the behavior of an expert policy in a sequential decision-making problem given only demonstrations. In this paper, we focus on understanding the minimax statistical limits of IL in episodic Markov Decision Processes (MDPs). We first consider the setting where the learner is provided a dataset of $N$ expert trajectories ahead of time, and cannot interact with the MDP. Here, we show that the policy which mimics the expert whenever possible is in expectation $\lesssim \frac{|\mathcal{S}| H^2 \log (N)}{N}$ suboptimal compared to the value of the expert, even when the expert follows an arbitrary stochastic policy. Here $\mathcal{S}$ is the state space, and $H$ is the length of the episode. Furthermore, we establish a suboptimality lower bound of $\gtrsim |\mathcal{S}| H^2 / N$ which applies even if the expert is constrained to be deterministic, or if the learner is allowed to actively query the expert at visited states while interacting with the MDP for $N$ episodes. To our knowledge, this is the first algorithm with suboptimality having no dependence on the number of actions, under no additional assumptions. We then propose a novel algorithm based on minimum-distance functionals in the setting where the transition model is given and the expert is deterministic. The algorithm is suboptimal by $\lesssim \min \{ H \sqrt{|\mathcal{S}| / N} ,\ |\mathcal{S}| H^{3/2} / N \}$, showing that knowledge of transition improves the minimax rate by at least a $\sqrt{H}$ factor.
Abstract（参考訳）: 模倣学習(il)は、デモンストレーションのみを与えられた逐次意思決定問題において、専門家ポリシーの振る舞いを模倣することを目的としている。本稿では,マルコフ決定過程(MDP)におけるILの最小統計限界を理解することに焦点を当てる。まず,学習者が事前に$N$のエキスパートトラジェクトリのデータセットを提供して,MDPと対話できないような設定について検討する。ここでは、専門家を可能な限り模倣するポリシーは、専門家が任意の確率的ポリシーに従う場合でも、専門家の値と比較すると、$\lesssim \frac{|\mathcal{S}| H^2 \log (N)}{N}$ suboptimalであることを示す。ここで$\mathcal{S}$は状態空間であり、$H$はエピソードの長さである。さらに、エキスパートが決定論的であることに制約されている場合や、学習者が訪問状態のエキスパートにN$のエピソードでMDPと対話しながら積極的に問い合わせることが許されている場合であっても、サブ最適下限の$\gtrsim |\mathcal{S}| H^2 / N$を確立する。我々の知る限り、このアルゴリズムは、追加の仮定なしで、アクションの数に依存しない最適でない最初のアルゴリズムである。次に、遷移モデルが与えられ、専門家が決定論的な設定において、最小距離関数に基づく新しいアルゴリズムを提案する。このアルゴリズムは、$\lesssim \min \{ H \sqrt{|\mathcal{S}| / N} ,\ |\mathcal{S}| H^{3/2} / N \}$ によって最適化され、遷移の知識が少なくとも$\sqrt{H}$因子によってミニマックス率を改善することを示す。

論文の概要: Toward the Fundamental Limits of Imitation Learning

関連論文リスト