Fugu-MT 論文翻訳(概要): Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

論文の概要: Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

arxiv url: http://arxiv.org/abs/2604.04218v1
Date: Sun, 05 Apr 2026 18:31:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.988486
Title: Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization
Title（参考訳）: LDTZ学習率を用いたQ学習のためのシャープ漸近理論とその一般化
Authors: Soham Bonnerjee, Zhipeng Lou, Wei Biao Wu,
Abstract要約: 提案した線形減衰をゼロ(textttLD2Z: $_t,n= (1-t/n)$) とすると、両世界の最良の結果が得られる。また、Q-ラーニングイテレートの部分和過程に対して、新しい時間一様ガウス近似(textttPD2Z-$$)を提供する。
参考スコア（独自算出の注目度）: 1.7842332554022695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant ($η_{t}\equiv η$) or polynomially decaying ($η_{t} = ηt^{-α}$) learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}: $η_{t,n}=η(1-t/n)$) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}-$ν$: $η_{t,n}=η(1-t/n)^ν$). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \texttt{PD2Z}-$ν$ schedule, which then is used to derive a central limit theory for a new \textit{tail} Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textit{strong invariance principle}) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \texttt{LD2Z} and in general \texttt{PD2Z}-$ν$ achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \texttt{LD2Z} while providing practical guidelines for inference through our results.
Abstract（参考訳）: 政策決定の実用的な道具としてQ-ラーニングの人気が持続しているにもかかわらず、関連する理論文献の大半は定数(η_{t}\equiv η$)または多項式減衰(η_{t} = ηt^{-α}$)の学習スケジュールを扱う。しかしながら、これらの選択が永続的バイアスまたは禁断的に緩やかな収束に悩まされることはよく知られている。これとは対照的に、最近提案された 0 への線型減衰(\texttt{LD2Z}: $η_{t,n}=η(1-t/n)$)スケジュールは、有意な経験的性能を示しているが、その理論的および統計的性質は、特に Q-ラーニング環境では、ほとんど探索されていない。文献におけるこのギャップは、まず、ゼロへのパワーローの一般クラスを考える(\texttt{PD2Z}-$ν$: $η_{t,n}=η(1-t/n)^ν$)。ステップ・バイ・ステップ(ステップ・バイ・ステップ)では, Q-ラーニングの急激な非漸近誤差を \texttt{PD2Z}-$ν$ schedule で示し,新しい \textit{tail} Polyak-Ruppert averaging estimator に対する中心極限理論を導出する。最後に,Q-ラーニングイテレートの部分和過程に対する時間一様ガウス近似(「textit{strong invariance principle」とも呼ばれる)も提供し,ブートストラップに基づく推論を容易にする。全ての理論的結果は、広範な数値実験によって補完される。 Q-ラーニング文学への新たな理論的・統計的貢献の他に、我々の研究結果は、漸近収束保証(多項式崩壊スケジュールの特性)を維持しつつ、初期化(定数ステップサイズの特徴)から急激な崩壊を継承し、一般の『texttt{LD2Z}』と『$ν$』が最高の世界的特性を達成することを確定的に証明した。この双対的優位性は, 結果から推論の実践的ガイドラインを提供しながら, 経験的成功を説明するものである。

論文の概要: Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

関連論文リスト