Fugu-MT 論文翻訳(概要): Delightful Gradients Accelerate Corner Escape

論文の概要: Delightful Gradients Accelerate Corner Escape

arxiv url: http://arxiv.org/abs/2605.11908v1
Date: Tue, 12 May 2026 10:21:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.79422
Title: Delightful Gradients Accelerate Corner Escape
Title（参考訳）: 快適なグラディエントはコーナーエスケープを加速する
Authors: Jincheng Mei, Ian Osband,
Abstract要約: 本研究は,emphDelightful Policy Gradient (DG) について考察する。我々は、正確な反例を通して、このメカニズムが共有関数近似の下で失敗することを示した。
参考スコア（独自算出の注目度）: 6.396365507203636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Softmax policy gradient converges at $O(1/t)$, but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions reinforce the corner policy and can initially push the optimal action backward. We study \emph{Delightful Policy Gradient} (DG), which gates each policy-gradient term by the product of advantage and action surprisal. For $K$-armed bandits, we prove that the zero-temperature limit of DG removes this corner-trapping mechanism on a quantitative sector near any sub-optimal corner, yielding a first-exit escape bound logarithmic in the initial probability ratio. At every fixed temperature, the same local mechanism persists because harmful actions are polynomially suppressed as they become rare. A key structural insight is that every action better than the corner action is an \emph{ally}: its contribution to escape is non-negative. Combining corner instability with a monotonic value improvement identity, we prove that DG converges globally to the optimal policy in both bandits and tabular MDPs at an asymptotic $O(1/t)$ rate. We also show, via an exact counterexample, that this tabular mechanism can fail under shared function approximation. In MNIST contextual bandits with a shared-parameter neural network, DG nevertheless recovers from bad initializations faster than standard policy gradient, suggesting that the counterexample marks a boundary of the theory rather than a practical prohibition.
Abstract（参考訳）: ソフトマックスのポリシー勾配は$O(1/t)$で収束するが、単純体の準最適角付近の過渡的な振舞いは指数関数的に遅くなる。負のアドバンテージアクションはコーナーポリシーを強化し、最初は最適なアクションを後方に押し出すことができる。そこで我々は,各政策段階の項を,優位性と行動前提の積によってゲートする「emph{Delightful Policy Gradient} (DG)」について検討する。 K$の武器付きバンディットの場合、DGのゼロ温度制限は、任意の準最適角近傍の定量的セクターにおけるこのコーナートラッピング機構を除去し、初期確率比において第一出口境界対数となることを証明している。どの温度でも、有害な作用が多項式的に抑制されるため、同じ局所メカニズムが持続する。重要な構造的洞察は、コーナーアクションよりも優れたすべてのアクションは \emph{ally} であり、そのエスケープへの寄与は非負であるということである。コーナー不安定性とモノトニック値改善IDを組み合わせることで、DGは帯域幅と表状MDPの両方において、漸近的な$O(1/t)$レートで、グローバルに収束することを示す。また、正確な反例を通して、この表型機構は共有関数近似の下で失敗する可能性があることを示す。共有パラメータニューラルネットワークを用いたMNISTの文脈的帯域幅では、DGは標準方針勾配よりも早く悪い初期化から回復し、反例は実用的な禁止ではなく理論の境界を示すことを示唆している。

論文の概要: Delightful Gradients Accelerate Corner Escape

関連論文リスト