Fugu-MT 論文翻訳(概要): Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

論文の概要: Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

arxiv url: http://arxiv.org/abs/2604.17457v2
Date: Tue, 21 Apr 2026 11:49:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 14:04:47.923263
Title: Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration
Title（参考訳）: ベルマン固定点を超えて: 値反復における幾何と高速なポリシー同定
Authors: Donghwan Lee,
Abstract要約: 本稿では,スイッチングシステム理論のレンズを通して,ディスカウントされたQ-VIを再検討する。 Q-VI は一般に有限時間において$Q*$に到達しないが、有限時間において最適な作用クラスを特定する。
参考スコア（独自算出の注目度）: 7.8232617281369805
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dynamic programming is one of the most fundamental methodologies for solving Markov decision problems. Among its many variants, Q-value iteration (Q-VI) is particularly important due to its conceptual simplicity and its classical contraction-based convergence guarantee. Despite the central role of this contraction property, it does not fully reveal the geometric structure of the Q-VI trajectory. In particular, when one is interested not only in the final limit $Q^*$ but also in when the induced greedy policy becomes effectively optimal, the standard contraction argument provides only a coarse characterization. To formalize this notion, we denote by $\mathcal X^*$ the set of $Q$-functions whose corresponding tie-broken greedy policies are optimal, referred to as the practically optimal solution set (POS). In this paper, we revisit discounted Q-VI through the lens of switching system theory and derive new geometric insights into its behavior. In particular, we show that although Q-VI does not reach $Q^*$ in finite time in general, it identifies the optimal action class in finite time. Furthermore, we prove that the distance from the iterate to a particular subset of $\mathcal X^*$ decays exponentially at a rate governed by the joint spectral radius (JSR) of a restricted switching family. This rate can be strictly faster than the standard $γ$ rate when the restricted JSR is strictly smaller than $γ$, while the convergence of the entire $Q$-function to $Q^*$ can still be dominated by the slower $γ$ mode, where $γ$ denotes the discount factor. These results reveal a two-stage geometric behavior of Q-VI: a fast convergence toward $\mathcal X_1$, followed by a slower convergence toward $Q^*$ in general.
Abstract（参考訳）: 動的プログラミングはマルコフ決定問題を解くための最も基本的な方法論の1つである。多くの変種の中で、Q-値反復(Q-VI)は概念的単純さと古典的収縮に基づく収束を保証するため特に重要である。この収縮特性の中心的な役割にもかかわらず、Q-VI軌道の幾何学的構造を完全には明らかにしていない。特に、最終極限$Q^*$だけでなく、誘導された欲求ポリシーが効果的に最適になったときにも、標準収縮論証は粗い特徴を与えるだけである。この概念を定式化するには、$\mathcal X^*$ で、対応する階層的欲求ポリシーが最適である$Q$-函数の集合を、実際最適解集合 (POS) と呼ぶ。本稿では, スイッチングシステム理論のレンズを通して, Q-VI の割引を再度検討し, その挙動に関する新たな幾何学的洞察を導出する。特に、Q-VI は一般に有限時間において$Q^*$に到達しないが、有限時間において最適な作用クラスを特定する。さらに、イテレートから$\mathcal X^*$の特定の部分集合への距離は、制限されたスイッチングファミリーの結合スペクトル半径(JSR)に支配される速度で指数関数的に崩壊することを示す。制限されたJSRが$γ$よりも厳密に小さい場合、このレートは標準の$γ$よりも厳格に速いが、$Q$関数全体の$Q^*$への収束は、さらに遅い$γ$モードによって支配され、$γ$は割引係数を表す。これらの結果は、Q-VIの2段階の幾何学的挙動を示す:$\mathcal X_1$への高速収束と、一般に$Q^*$への緩やかな収束である。

論文の概要: Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

関連論文リスト