Fugu-MT 論文翻訳(概要): A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

論文の概要: A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

arxiv url: http://arxiv.org/abs/2510.16132v1
Date: Fri, 17 Oct 2025 18:19:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:38.860952
Title: A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies
Title（参考訳）: 時変ポリシを用いたQ-Learningの最小評価分析
Authors: Phalguni Nanda, Zaiwei Chen,
Abstract要約: 時変学習ポリシーの下でQ-ラーニングアルゴリズムの最初の有限時間解析を行う。その結果、政治上のQ-ラーニングは、政治外のQ-ラーニングよりも探究が弱いことが判明した。数値シミュレーションは我々の理論を裏付ける。
参考スコア（独自算出の注目度）: 3.950802208390739
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we present the first finite-time analysis of the Q-learning algorithm under time-varying learning policies (i.e., on-policy sampling) with minimal assumptions -- specifically, assuming only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$, implying a sample complexity of order $O(1/\epsilon^2)$ for achieving $\mathbb{E}[\|Q_k - Q^*\|_\infty] \le \epsilon$, matching that of off-policy Q-learning but with a worse dependence on exploration-related parameters. We also derive an explicit rate for $\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]$, where $\pi_k$ is the learning policy at iteration $k$. These results reveal that on-policy Q-learning exhibits weaker exploration than its off-policy counterpart but enjoys an exploitation advantage, as its policy converges to an optimal one rather than remaining fixed. Numerical simulations corroborate our theory. Technically, the combination of time-varying learning policies (which induce rapidly time-inhomogeneous Markovian noise) and the minimal assumption on exploration presents significant analytical challenges. To address these challenges, we employ a refined approach that leverages the Poisson equation to decompose the Markovian noise corresponding to the lazy transition matrix into a martingale-difference term and residual terms. To control the residual terms under time inhomogeneity, we perform a sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These tools may further facilitate the analysis of general reinforcement learning algorithms with rapidly time-varying learning policies -- such as single-timescale actor--critic methods and learning-in-games algorithms -- and are of independent interest.
Abstract（参考訳）: 本研究では,Q-ラーニングアルゴリズムを最小限の仮定で,Q-ラーニングアルゴリズムの最初の有限時間解析を行い,特に状態空間上の既約マルコフ連鎖を誘導するポリシーの存在を仮定する。我々は、$\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$に対して、$\mathbb{E}[\|Q_k - Q^*\|_\infty] \le \epsilon$を達成するために、$O(1/\epsilon^2)$のサンプル複雑性を暗示する。また、$\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]$に対して明示的なレートを導出します。これらの結果は、オンラインのQ-ラーニングは、政治以外のものよりも探究が弱いが、その政策が固定されたままではなく最適なものに収束するので、搾取の利点を享受していることを示している。数値シミュレーションは我々の理論を裏付ける。技術的には、時間変化学習ポリシー(急激な時間不均一なマルコフノイズを誘発する)と探索に関する最小限の仮定の組み合わせは、重要な分析上の課題を示す。これらの課題に対処するため、ポアソン方程式を利用して遅延遷移行列に対応するマルコフ雑音をマルティンゲール差分項と残留項に分解する。時間的不均一性の下での残差項を制御するために、Q関数推定と学習ポリシーの両方に関してポアソン方程式解の感度解析を行う。これらのツールは、シングルタイムスケールアクターの方法やゲーム内の学習アルゴリズムなど、急速に変化する学習ポリシーを持つ一般的な強化学習アルゴリズムの分析をさらに促進し、独立した関心を持つ可能性がある。

論文の概要: A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

関連論文リスト