Fugu-MT 論文翻訳(概要): Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

論文の概要: Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

arxiv url: http://arxiv.org/abs/2207.11126v1
Date: Fri, 22 Jul 2022 15:00:15 GMT
ステータス: 翻訳完了
システム内更新日: 2022-07-25 13:34:13.760860
Title: Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP
Title（参考訳）: 文脈に面した最適性:確率的文脈 MDP に対する規則的保証
Authors: Orin Levy and Yishay Mansour
Abstract要約: 我々は,最小到達可能性仮定の下での文脈的MDPに対する後悔のアルゴリズムを提案する。我々のアプローチは、一般関数近似を用いた文脈的MDPに適用された最初の楽観的アプローチである。
参考スコア（独自算出の注目度）: 46.86114958340962
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains $ \tilde{O}\left( \max\{H,{1}/{p_{min}}\}H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{F}|,|\mathcal{P}|\}/\delta)} \right)$ regret bound, with probability $1-\delta$, where $\mathcal{P}$ and $\mathcal{F}$ are finite and realizable function classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). In addition, we present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\mathcal{F}|/|S|)/\ln(|A|)})$, on the expected regret which holds even in the case of known dynamics.
Abstract（参考訳）: 我々は,最小到達可能性仮定の下での確率的文脈的MDPに対する最小化アルゴリズムを,オフライン最小二乗回帰オラクルへのアクセスを用いて提案する。ダイナミクスが分かっている場所、ダイナミクスが未知だがコンテキストに依存しない場所、そしてダイナミクスが未知でコンテキスト依存である最も困難な設定の3つの設定を分析します。後者について、アルゴリズムは$ \tilde{o}\left( \max\{h,{1}/{p_{min}}\}h|s|^{3/2}\sqrt{|a|t\log(\max\{|\mathcal{f}|,|\mathcal{p}|\}/\delta)} \right)$ regret bound, with probability $1-\delta$, where $\mathcal{p}$ and $\mathcal{f}$ is finite and realizable function class using the dynamics and rewards, $p_{min}$ is the minimum reachability parameter, $s$ is the set of actions, $a$ the horizon, $h$ the $t$ the episodes of episodes, $t$ the number of episodes を得る。我々の知る限り、我々のアプローチは一般関数近似を用いた文脈的 MDP に適用された最初の楽観的なアプローチである(すなわち、線形であるような関数クラスに関する追加の知識がない)。さらに、既知の力学の場合でさえ成り立つ期待された後悔に対して、$\Omega(\sqrt{T H |S| |A| \ln(|\mathcal{F}|/|S|)/\ln(|A|)})$の低い境界を示す。

関連論文リスト

Online Newton Method for Bandit Convex Optimisation [28.66596225688161]
ゼロ階帯域幅の最適化のための計算効率の良いアルゴリズムを提案する。逆条件では、その後悔は少なくとも$d3.5 sqrtn Mathrmpolylog(n, d)$であり、d$が時間的地平線である確率が高いことを証明している。設定において、バウンダリは$M d2 sqrtn Mathrmpolylog(n, d)$に改善され、[d-1/2, d-1 / 4]$は$Mとなる。
論文参考訳（メタデータ） (2024-06-10T17:44:11Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
我々のアルゴリズムが $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping is linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ is the standardities of the state and action space。
論文参考訳（メタデータ） (2023-05-15T05:37:32Z)
Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation [47.18926328995424]
我々は,敵対的文脈 MDP における後悔最小化のためのOMG-CMDP!アルゴリズムを提案する。我々のアルゴリズムは効率的であり(効率的なオンライン回帰オラクルを仮定する)、近似誤差に対して単純で堅牢である。
論文参考訳（メタデータ） (2023-03-02T18:27:00Z)
Eluder-based Regret for Stochastic Contextual MDPs [43.19667415823089]
文脈マルコフ決定過程(CMDP)における後悔最小化のためのE-UC$3$RLアルゴリズムを提案する。我々のアルゴリズムは効率的であり(効率的なオフライン回帰オラクルを仮定すると)、$ widetildeO(H3 sqrtT |S| |A|d_mathrmE(mathcalP)$の後悔の保証を享受する。
論文参考訳（メタデータ） (2022-11-27T20:38:47Z)
On Submodular Contextual Bandits [92.45432756301231]
作用が基底集合の部分集合であり、平均報酬が未知の単調部分モジュラ函数によってモデル化されるような文脈的包帯の問題を考える。 Inverse Gap Weighting 戦略により,提案アルゴリズムは推定関数の局所的最適度を効率よくランダム化することを示す。
論文参考訳（メタデータ） (2021-12-03T21:42:33Z)
Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally [58.463668865380946]
状態空間 $mathcalS$ を用いたエピソードマルコフ決定過程 (MDPs) における模擬学習の統計的限界について検討する。 rajaraman et al (2020) におけるmdアルゴリズムを用いた準最適性に対する上限 $o(|mathcals|h3/2/n)$ を定式化する。 Omega(H3/2/N)$ $mathcalS|geq 3$ であるのに対して、未知の遷移条件はよりシャープレートに悩まされる。
論文参考訳（メタデータ） (2021-02-25T15:50:19Z)
Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
本稿では、特にバッチ強化学習に適した報酬不要強化学習フレームワークと、複数の報酬関数に対するポリシーを必要とするシナリオについて検討する。 textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname) という新しい効率的なアルゴリズムを提供しています。
論文参考訳（メタデータ） (2020-10-12T17:51:19Z)
Toward the Fundamental Limits of Imitation Learning [29.87139380803829]
シミュレーション学習(IL)は、実演のみを与えられた逐次的な意思決定問題において、専門家の政策の振る舞いを模倣することを目的としている。まず,学習者が事前に$N$のエキスパートトラジェクトリのデータセットを提供して,MDPと対話できないような設定について検討する。可能な限り専門家を模倣するポリシーは、専門家が任意のポリシーに従う場合でも、専門家の値と比較すると、$lesssim frac|mathcalS| H2 log (N)N$ suboptimalであることを示す。
論文参考訳（メタデータ） (2020-09-13T12:45:52Z)
$Q$-learning with Logarithmic Regret [60.24952657636464]
楽観的な$Q$は$mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of state, $A$ is the number of action, $H$ is the planning horizon, $T$ is the total number of steps, $Delta_min$ is the least sub-Optitimality gap。
論文参考訳（メタデータ） (2020-06-16T13:01:33Z)
Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
我々は,損失予測の助けを借りて,文脈的包帯における学習を研究する。最適な後悔は$mathcalO(minsqrtT, sqrtmathcalETfrac13)$である。
論文参考訳（メタデータ） (2020-03-04T07:36:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。