Fugu-MT 論文翻訳(概要): Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

論文の概要: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

arxiv url: http://arxiv.org/abs/2605.22791v1
Date: Thu, 21 May 2026 17:44:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 20:14:18.614353
Title: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Title（参考訳）: Gated DeltaNet-2:EraseとWriteをリニアアテンションで分離する
Authors: Ali Hatamizadeh, Yejin Choi, Jan Kautz,
Abstract要約: リニアアテンションは、ソフトマックスアテンションのキャッシュを固定サイズのリカレント状態に置き換え、シーケンシャルミキシングを線形時間に短縮し、定メモリに復号する。我々はGated DeltaNet-2を紹介し、Gated DeltaNetとKim Delta Attentionの両方を一般化する。 Gated DeltaNet-2は、言語モデリング、常識推論、検索にまたがるMamba Gated DeltaNet、KDA、Mamba-3の変種の中で、最も優れた総合的な結果を得る。
参考スコア（独自算出の注目度）: 81.79922329750674
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.
Abstract（参考訳）: リニアアテンションはソフトマックスアテンションの非有界キャッシュを固定サイズのリカレント状態に置き換え、シーケンシャルミキシングを線形時間に短縮し、復号を定数メモリに短縮する。難しいのは、忘れるべきことではなく、圧縮されたメモリの編集方法です。 Delta-ruleモデルは、新しい値を書く前に現在の読み込みを減らし、Kimi Delta Attention (KDA)はチャネルワイドの減衰で忘れを鋭くする。しかし、アクティブ編集では、キー側で削除する古いコンテンツと、値側でコミットする新しいコンテンツの2つの異なるコントロールに、単一のスカラーゲートを使用している。我々はGated DeltaNet-2を導入し、Gated DeltaNetとKDAの両方を一般化する。 Gated Delta Rule-2は、チャネルワイズ消去ゲートb_tとチャネルワイズ書き込みゲートw_tとでこれらの役割を分離し、両方のゲートが同じスカラーに崩壊した場合はKDAに、崩壊も崩壊した場合はGated DeltaNetに還元する。高速な更新ビューと,非対称な消去要因に吸収されるチャネルワイドなWYアルゴリズムと,効率的な並列トレーニングを保ったゲートアウェア・バックワードパスを導出する。 100B FineWeb-Eduトークンでトレーニングされた1.3Bパラメータで、Gated DeltaNet-2は言語モデリング、常識推論、検索にまたがるMamba-2、Gated DeltaNet、KDA、Mamba-3の変種の中で、最も強力な総合的な結果を達成する。長文RULER針-in-a-haystackベンチマークでは、評価されたマルチキー検索設定を改善し、リカレントとハイブリットの両方で強力である。コードはhttps://github.com/NVlabs/GatedDeltaNet-2で公開されている。

論文の概要: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

関連論文リスト