Fugu-MT 論文翻訳(概要): Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

論文の概要: Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

arxiv url: http://arxiv.org/abs/2303.10165v2
Date: Wed, 14 Feb 2024 17:44:08 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-15 20:09:21.710539
Title: Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs
Title（参考訳）: 線形混合mdpのための最適地平線なし報酬フリー探索法
Authors: Junkai Zhang and Weitong Zhang and Quanquan Gu
Abstract要約: 線形マルコフ決定過程(MDP)を学習するための新たな報酬なしアルゴリズムを提案する。我々のアルゴリズムの核心は、探索駆動の擬似回帰を用いた不確実性重み付き値目標回帰である。我々のアルゴリズムは$tilde O(d2varepsilon-2)$ episodesを探索するだけで、$varepsilon$-optimal policyを見つけることができる。
参考スコア（独自算出の注目度）: 60.40452803295326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study reward-free reinforcement learning (RL) with linear function approximation, where the agent works in two phases: (1) in the exploration phase, the agent interacts with the environment but cannot access the reward; and (2) in the planning phase, the agent is given a reward function and is expected to find a near-optimal policy based on samples collected in the exploration phase. The sample complexities of existing reward-free algorithms have a polynomial dependence on the planning horizon, which makes them intractable for long planning horizon RL problems. In this paper, we propose a new reward-free algorithm for learning linear mixture Markov decision processes (MDPs), where the transition probability can be parameterized as a linear combination of known feature mappings. At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward and a high-order moment estimator for the aleatoric and epistemic uncertainties. When the total reward is bounded by $1$, we show that our algorithm only needs to explore $\tilde O( d^2\varepsilon^{-2})$ episodes to find an $\varepsilon$-optimal policy, where $d$ is the dimension of the feature mapping. The sample complexity of our algorithm only has a polylogarithmic dependence on the planning horizon and therefore is "horizon-free". In addition, we provide an $\Omega(d^2\varepsilon^{-2})$ sample complexity lower bound, which matches the sample complexity of our algorithm up to logarithmic factors, suggesting that our algorithm is optimal.
Abstract（参考訳）: そこでは,(1)探索段階では,エージェントは環境と相互作用するが報酬にはアクセスできない,(2)計画段階では報酬関数が与えられ,探索段階で収集されたサンプルに基づいて,ほぼ最適方針が求められる,という2つのフェーズでエージェントが機能する線形関数近似を用いた報酬不要強化学習(RL)について検討する。既存の報酬のないアルゴリズムのサンプルの複雑さは計画の地平線に依存するため、長期計画の地平線rl問題では役に立たない。本稿では,線形混合マルコフ決定過程(MDP)を学習するための新たな報奨自由アルゴリズムを提案し,その遷移確率を既知の特徴写像の線形結合としてパラメータ化する。提案アルゴリズムのコアとなるのは,探索駆動擬似回帰による不確実性重み付き値目標回帰と,アレタリックおよびエピステマティック不確実性に対する高次モーメント推定器である。合計報酬が$$$に制限されている場合、我々のアルゴリズムは$\tilde O(d^2\varepsilon^{-2})$のエピソードを探索するだけで、$\varepsilon$-Optimal Policyを見つけることができる。このアルゴリズムのサンプル複雑性は計画の地平線に多対数依存性しか持たず、従って「ホライゾンフリー」である。さらに, アルゴリズムのサンプル複雑性を対数因子に合わせることで, アルゴリズムが最適であることを示す,$\Omega(d^2\varepsilon^{-2})$ sample complexity lower boundを提供する。

関連論文リスト

Policy Mirror Descent Inherently Explores Action Space [10.772560347950053]
1/epsilon2)$tildemathcalO($tildemathcalO)を,探索戦略を使わずに,オンラインポリシグラデーションメソッドのサンプル複雑性を初めて確立しました。新しいポリシー勾配法は、最適なポリシーを探す際に、潜在的にリスクの高い行動に繰り返しコミットするのを防ぐことができる。
論文参考訳（メタデータ） (2023-03-08T05:19:08Z)
Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation [16.871660060209674]
本研究では, 線形関数近似を用いた展開効率向上強化学習(RL)の課題を, 遠近自由探索条件下で検討する。我々は,最大$widetildeO(fracd2H5epsilon2)$ trajectoriesを$H$デプロイメント内で収集し,$epsilon$-Optimal Policyを任意の(おそらくはデータに依存した)報酬関数の選択に対して識別するアルゴリズムを提案する。
論文参考訳（メタデータ） (2022-10-03T03:48:26Z)
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) は、エージェントが探索中に報酬関数にアクセスできないような環境を考える。この分離は線形MDPの設定には存在しないことを示す。我々は$d$次元線形 MDP における報酬のない RL に対する計算効率の良いアルゴリズムを開発した。
論文参考訳（メタデータ） (2022-01-26T22:09:59Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
エージェントが事前に特定された報酬関数を使わずに環境を徹底的に探索することを目的とした報酬のないRL問題について検討する。関数近似の文脈でこの問題に取り組み、強力な関数近似器を活用する。我々は、カーネルとニューラルファンクション近似器を用いた、証明可能な効率の良い報酬なしRLアルゴリズムを確立した。
論文参考訳（メタデータ） (2021-10-19T07:26:33Z)
Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation [92.99933928528797]
エピソードマルコフ決定過程(MDP)に対する線形関数近似を用いたモデルに基づく無報酬強化学習について検討する。計画段階では、特定の報酬関数が与えられ、探索フェーズから収集したサンプルを使用して良い政策を学ぶ。任意の報酬関数に対して$epsilon$-optimal Policyを得るには,最大$tilde O(H4d(H + d)epsilon-2)$ episodesをサンプリングする必要がある。
論文参考訳（メタデータ） (2021-10-12T23:03:58Z)
On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) は、バッチRL設定と多くの報酬関数がある設定の両方に適したフレームワークである。本研究では,線形関数近似を用いた報酬のないRLに対して,正と負の両方の結果を与える。
論文参考訳（メタデータ） (2020-06-19T17:59:36Z)
Adaptive Reward-Free Exploration [48.98199700043158]
提案アルゴリズムは1994年からのFiechterのアルゴリズムの変種と見なすことができる。さらに、報酬のない探索と最高の政治識別の相対的な複雑さについて検討する。
論文参考訳（メタデータ） (2020-06-11T09:58:03Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
探索の課題を分離する「逆フリーなRL」フレームワークを提案する。我々は,$tildemathcalO(S2Amathrmpoly(H)/epsilon2)$の探索を効率的に行うアルゴリズムを提案する。また、ほぼ一致する$Omega(S2AH2/epsilon2)$ lower boundを与え、この設定でアルゴリズムのほぼ最適性を示す。
論文参考訳（メタデータ） (2020-02-07T14:03:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。