Fugu-MT 論文翻訳(概要): Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

論文の概要: Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

arxiv url: http://arxiv.org/abs/2603.23926v1
Date: Wed, 25 Mar 2026 04:34:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.129292
Title: Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs
Title（参考訳）: Infinite-Horizon MDPに対する最適分散依存性レギュレット境界
Authors: Guy Zamir, Matthew Zurek, Yudong Chen,
Abstract要約: 無限水平マルコフ決定過程(MDP)におけるオンライン強化学習は、そのエピソード的学習よりも理論上、アルゴリズム上は発展していない。本研究では、古典的平均逆後悔と$$-regretという2つの無限水平目的に対する欠点に対処する。両設定に適用可能な単一トラクタブルなUPBスタイルのアルゴリズムを開発し、このアルゴリズムは、最初の最適分散依存後悔保証を実現する。
参考スコア（独自算出の注目度）: 8.923988278588768
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $γ$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.
Abstract（参考訳）: 無限水平マルコフ決定過程(MDP)におけるオンライン強化学習は、そのエピソジックな手法よりも理論的・アルゴリズム的に発展し続けており、多くのアルゴリズムは高い『バーンイン』コストに悩まされ、良心的なインスタンス固有の複雑さに適応できなかった。本研究では、古典的平均逆後悔と$γ$-regretという2つの無限水平目的に対するこれらの欠点に対処する。両設定に適用可能な単一トラクタブルなUPBスタイルのアルゴリズムを開発し、このアルゴリズムは、最初の最適分散依存後悔保証を実現する。ここで$S,A$は状態とアクション空間のサイズであり、$\text{Var}$は累積遷移分散をキャプチャする。これは、最悪の場合、最小値の最適平均リワードと$γ$-regret境界を示すが、例えば決定論的 MDP においてほぼ一定の後悔をもたらすような、より簡単な問題インスタンスにも適応することを意味する。さらに,提案アルゴリズムは,平均回帰設定における下位項の大幅な改善を享受する。最適バイアスの事前の知識は、$\Vert h^\star\Vert_\text{sp}$であり、このアルゴリズムは、$\Vert h^\star\Vert_\text{sp} S^2 A$と、$\Vert h^\star\Vert_\text{sp}$と$A$の両方で最適であることを示す。事前知識がなければ、アルゴリズムは$\Vert h^\star \Vert_\text{sp}^2 S A$より小さい下位項を持つことができないことを証明し、この下位境界にほぼ一致する、$\Vert h^\star\Vert_\text{sp}^2 S^3 A$という下位項のスケールを持つ事前自由なアルゴリズムを提供する。これらの結果は、先行項と下階項の両方において$\Vert h^\star\Vert_\text{sp}$に対する最適依存を完全に特徴付け、事前の知識がなければ達成できることの根本的なギャップを明らかにする。

論文の概要: Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

関連論文リスト