Fugu-MT 論文翻訳(概要): OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

論文の概要: OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

arxiv url: http://arxiv.org/abs/2605.13473v1
Date: Wed, 13 May 2026 12:59:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.056826
Title: OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
Title（参考訳）: OSDN: オンラインプレコンディショニングが可能なリニアアテンションによるデルタルールの改善
Authors: Chenyu Zhou, Hongpei Li, Yuerou Liu, Jianghao Lin, Dongdong Ge, Yinyu Ye,
Abstract要約: Online Scaled DeltaNetは、JRTスタイルのインコンテキストリコールをDeltaNetよりも32%改善した。 1.3Bパラメータにスケーリングすると、リコール残差比が39%減少する。
参考スコア（独自算出の注目度）: 12.93065958346192
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) -- demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.
Abstract（参考訳）: 線形アテンションと状態空間モデルは、ソフトマックスアテンションの代替として一定メモリを提供するが、しばしばコンテキスト内連想リコールに悩まされる。 Delta Ruleは、各トークンをオンライン勾配の1ステップで記述することで、これを緩和する。しかし、そのステップサイズは1つのスカラーゲートに依存しており、内部の目的の特徴的曲率を無視している。我々は,オンラインスケールドデルタネット (OSDN) を提案する。これはスカラーゲートを強化し,高次フィードバックによってオンラインに更新される対角プレコンディショナーを備える。重要なことに、この右プレコンディショニングは、書き込み側キーの関数単位のスケーリングと代数的に等価である。この等価性により、OSDNはDeltaNetのハードウェアフレンドリーなチャンクワイド並列パイプラインを高次元のオーバーヘッドを発生させることなく厳密に保持することができる。理論的には、内部回帰損失の正確な2次構造を利用して、右ニュートンコンパレータに対する超幾何収束を確立し、アルゴリズムに整合したトークン局所残留収縮境界を証明する。非定常文脈を扱うために、我々はさらに適応プレコンディショナー・フォーッティング(APF)を導入し、安定化校正を動的にリフレッシュする。経験的に、OSDNはスケールにわたって強力なパフォーマンスを示している。 340Mパラメータスケールでは、OSDNはJRTスタイルのインコンテキストリコールをDeltaNetよりも32%改善している。 1.3Bパラメータにスケールすると、一般的な下流タスク(例えば、パープレキシティやLongBench)のパリティを維持しながら、リコール残余率を39%削減できます。

論文の概要: OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

関連論文リスト