Fugu-MT 論文翻訳(概要): Horizon-Free Reinforcement Learning for Latent Markov Decision Processes

論文の概要: Horizon-Free Reinforcement Learning for Latent Markov Decision Processes

arxiv url: http://arxiv.org/abs/2210.11604v1
Date: Thu, 20 Oct 2022 21:32:01 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-24 15:28:17.741084
Title: Horizon-Free Reinforcement Learning for Latent Markov Decision Processes
Title（参考訳）: 潜在マルコフ決定過程に対する水平自由強化学習
Authors: Runlong Zhou, Ruosong Wang, Simon S. Du
Abstract要約: 我々は,後期マルコフ決定過程(LMDP)における強化学習(RL)の文脈を考慮した後悔の最小化について検討した。我々は,モデル最適化と値最適化の両手法でインスタンス化できる,新しいモデルベースアルゴリズムフレームワークを設計する。
参考スコア（独自算出の注目度）: 62.90204655228324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver. We prove an $\widetilde{O}\left(\sqrt{M \Gamma S A K}\right)$ regret bound where $M$ is the number of contexts, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, and $\Gamma \le S$ is the maximum transition degree of any state-action pair. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors, which is carefully bounded by a recursion-based technique. We complement our positive result with a novel $\Omega\left(\sqrt{M S A K}\right)$ regret lower bound with $\Gamma = 2$, which shows our upper bound minimax optimal when $\Gamma$ is a constant. Our lower bound relies on new constructions of hard instances and an argument based on the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest.
Abstract（参考訳）: 潜在マルコフ決定過程(lmdps)における強化学習(rl)に対する後悔の最小化について検討した。我々は,モデル最適化と値最適化の両手法でインスタンス化できる新しいモデルベースアルゴリズムフレームワークを設計する。我々は$\widetilde{O}\left(\sqrt{M \Gamma S A K}\right)$ regret bound where $M$ is the number of contexts, $S$ is the number of state, $A$ is the number of actions, $K$ is the number of episodes, $\Gamma \le S$ is the maximum transition degree of any state-action pair。後悔のバウンドは計画の地平線と対数的にしかスケールしないので、lmdpに対して最初の(ほぼ)地平線なしの後悔となる。証明の鍵となるのは、再帰に基づく手法によって慎重に拘束されるアルファベクトルの総分散の分析である。我々は、新しい $\omega\left(\sqrt{m s a k}\right)$ regret lower bound with $\gamma = 2$ で正の結果を補完する。我々の下位境界は、理論計算機科学の対称性技術に基づく新しいハードインスタンスの構成と引数に依存しており、どちらも既存のMDPの下位境界証明と技術的に異なるため、独立した関心を持つことができる。

関連論文リスト

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
非線型関数近似を用いたモデルに基づく強化学習について検討する。本研究では,無限水平平均逆法と割引逆法の両方に有効である確率効率のよい値反復型アルゴリズムを開発した。
論文参考訳（メタデータ） (2024-06-19T15:29:14Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
本稿では,円滑なベルマン作用素を持つ連続空間マルコフ決定過程(MDP)の一般クラスにおいて,$varepsilon$-optimal Policyを学習する問題を考察する。我々のソリューションの鍵となるのは、調和解析のアイデアに基づく新しい射影技術である。我々の結果は、連続空間 MDP における2つの人気と矛盾する視点のギャップを埋めるものである。
論文参考訳（メタデータ） (2024-05-10T09:58:47Z)
A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
エピソードスパークリニアマルコフ決定過程(SMDP)に対する新しい後悔アルゴリズムを提案する。提案アルゴリズムは$tildeO(sigma-1_min s_star H sqrtN)$である。
論文参考訳（メタデータ） (2023-10-23T18:52:17Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
バーンインコストを発生させることなく、最小限の最適後悔を実現する方法を示す。最適値/コストや一定の分散といった問題依存量の影響を明らかにするために、我々の理論を拡張します。
論文参考訳（メタデータ） (2023-07-25T15:42:11Z)
Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation [25.60689712525918]
本稿では,遷移確率と報酬関数が線形な線形関数近似を用いた強化学習について検討する。本稿では,新たなアルゴリズムLSVI-UCB$+$を提案し,$H$がエピソード長,$d$が特徴次元,$T$がステップ数である場合に,$widetildeO(HdsqrtT)$ regretboundを実現する。
論文参考訳（メタデータ） (2022-06-23T06:04:21Z)
Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs [12.450760567361531]
オンライン学習問題では,低分散の活用がパフォーマンス保証の厳密化に重要な役割を果たしている。本研究は, 後悔の限界を著しく改善する新たな分析法を提案する。我々の分析は、新しい楕円型ポテンシャル数補題に依存している。
論文参考訳（メタデータ） (2021-11-05T06:47:27Z)
Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
線形バンドイットと線形混合決定プロセス(mdp)に対する分散認識信頼セットの構築方法を示す。線形バンドイットに対しては、$d を特徴次元とする$widetildeo(mathrmpoly(d)sqrt1 + sum_i=1ksigma_i2) が成り立つ。線形混合 MDP に対し、$widetildeO(mathrmpoly(d)sqrtK)$ regret bound を得る。
論文参考訳（メタデータ） (2021-01-29T18:57:52Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes [91.38793800392108]
本稿では,マルコフ決定過程(MDP)の遷移確率核が線形混合モデルである線形関数近似による強化学習について検討する。上記の線形混合 MDP に対して$textUCRL-VTR+$ という線形関数近似を用いた計算効率の良い新しいアルゴリズムを提案する。我々の知る限り、これらは線形関数近似を持つRLのための計算効率が良く、ほぼ最小のアルゴリズムである。
論文参考訳（メタデータ） (2020-12-15T18:56:46Z)
$Q$-learning with Logarithmic Regret [60.24952657636464]
楽観的な$Q$は$mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of state, $A$ is the number of action, $H$ is the planning horizon, $T$ is the total number of steps, $Delta_min$ is the least sub-Optitimality gap。
論文参考訳（メタデータ） (2020-06-16T13:01:33Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。