Fugu-MT 論文翻訳(概要): Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains

論文の概要: Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains

arxiv url: http://arxiv.org/abs/2510.25514v1
Date: Wed, 29 Oct 2025 13:38:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-30 15:50:45.59729
Title: Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains
Title（参考訳）: 可逆マルコフ連鎖に対する線形関数近似を用いたオフポリチックTD(0)の収束
Authors: Maik Overmars, Jasper Goseling, Richard Boucherie,
Abstract要約: マルコフ連鎖における期待割引報酬を近似するために,線形関数近似を用いたオフポリチックTD(0)の収束について検討した。我々のアプローチは、標準的なアルゴリズムを解析することであるが、可逆的なマルコフ連鎖のクラスへの注意を制限することである。
参考スコア（独自算出の注目度）: 0.17478203318226307
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the convergence of off-policy TD(0) with linear function approximation when used to approximate the expected discounted reward in a Markov chain. It is well known that the combination of off-policy learning and function approximation can lead to divergence of the algorithm. Existing results for this setting modify the algorithm, for instance by reweighing the updates using importance sampling. This establishes convergence at the expense of additional complexity. In contrast, our approach is to analyse the standard algorithm, but to restrict our attention to the class of reversible Markov chains. We demonstrate convergence under this mild reversibility condition on the structure of the chain, which in many applications can be assumed using domain knowledge. In particular, we establish a convergence guarantee under an upper bound on the discount factor in terms of the difference between the on-policy and off-policy process. This improves upon known results in the literature that state that convergence holds for a sufficiently small discount factor by establishing an explicit bound. Convergence is with probability one and achieves projected Bellman error equal to zero. To obtain these results, we adapt the stochastic approximation framework that was used by Tsitsiklis and Van Roy [1997 for the on-policy case, to the off-policy case. We illustrate our results using different types of reversible Markov chains, such as one-dimensional random walks and random walks on a weighted graph.
Abstract（参考訳）: マルコフ連鎖における期待割引報酬を近似するために,線形関数近似を用いたオフポリチックTD(0)の収束について検討した。オフポリシー学習と関数近似の組み合わせがアルゴリズムの分岐につながることはよく知られている。この設定の既存の結果は、例えば、重要サンプリングを使用して更新を再検討することで、アルゴリズムを変更する。これにより、さらなる複雑さを犠牲にして収束が確立される。対照的に、我々のアプローチは標準的なアルゴリズムを解析することであるが、可逆的なマルコフ連鎖のクラスへの注意を制限することである。この軽度可逆性条件の下での収束を鎖の構造上で示し、多くの応用においてドメイン知識を用いて仮定することができる。特に、非政治的プロセスと非政治的プロセスの差異の観点から、割引係数の上限の下で収束保証を確立する。これは、明らかな境界を確立することによって、収束が十分小さな割引係数を保っているという文献の既知の結果を改善する。収束は確率 1 であり、予想されるベルマン誤差は 0 に等しい。これらの結果を得るために、ツィシクリスとヴァン・ロイ (1997) が行った確率近似の枠組みを、政治外の場合に適用する。 1次元のランダムウォークや、重み付きグラフ上のランダムウォークなど、さまざまな可逆マルコフ連鎖を用いて、その結果を説明する。

論文の概要: Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains

関連論文リスト