Fugu-MT 論文翻訳(概要): Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

論文の概要: Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

arxiv url: http://arxiv.org/abs/2309.09457v2
Date: Tue, 19 Sep 2023 01:56:24 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-20 10:58:11.359376
Title: Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles
Title（参考訳）: 計算難解なOracleのない疎線形MDPの探索と学習
Authors: Noah Golowich and Ankur Moitra and Dhruv Rohatgi
Abstract要約: 本稿では,特徴選択の観点から線形MDPを再考する。我々の主な成果は、この問題に対する最初のアルゴリズムである。コンベックスプログラミングによって効率よく計算できることを示す。
参考スコア（独自算出の注目度）: 39.10180309328293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map $\phi(x, a)$ that maps state-action pairs to $d$-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a $k$-sparse linear MDP, there is an unknown subset $S \subset [d]$ of size $k$ containing all the relevant features, and the goal is to learn a near-optimal policy in only poly$(k,\log d)$ interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible.
Abstract（参考訳）: 基本となる線形マルコフ決定プロセス(mdps)は、学習者が既知の特徴写像$\phi(x, a)$にアクセスでき、状態-作用対を$d$-次元ベクトルにマッピングし、報酬と遷移がこの表現の線形関数である、という仮定である。しかし、これらの機能はどこから来るのか? 専門家のドメイン知識がなければ,‘kitchen sink’というアプローチを採用して,真の機能がもっと大きな機能セットに含まれていることを期待する,という誘惑的な戦略がある。本稿では,線形mdpを特徴選択の観点から再検討する。 a $k$-sparse linear MDP には、すべての関連する特徴を含む未知のサブセット $S \subset [d]$ of size $k$ が存在し、その目標は、環境との相互作用をpoly$(k,\log d)$でのみ学習することである。我々の主な結果は、この問題に対する最初の多項式時間アルゴリズムである。対照的に、初期の研究は、探索の必要性を損なう、あるいは計算的に難解な最適化問題を解く必要のある、禁止的に強い仮定をした。その過程で、あるベルマンバックアップを計算するのに十分である遷移の簡潔な近似表現であるエミュレータの概念を導入する。線形 MDP は非パラメトリックモデルであるため、多項式サイズのエミュレータが存在するかどうかさえ明らかではない。それらは存在し、凸プログラミングによって効率的に計算できることを示す。そこで本研究では,ブロックmdpにおいてデコード関数が低深さ決定木である近最適ポリシを学習するアルゴリズムを提案し,そのアルゴリズムを準多項時間で実行し,多項式数のサンプルを取る。これは計算学習理論における古典的な結果の強化学習類似体と見なすことができる。さらに、表現学習によるサンプル複雑性の向上が計算可能となる自然なモデルを与える。

論文の概要: Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

関連論文リスト