Fugu-MT 論文翻訳(概要): ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

論文の概要: ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

arxiv url: http://arxiv.org/abs/2606.25156v1
Date: Tue, 23 Jun 2026 20:43:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:05:30.140296
Title: ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory
Title（参考訳）: ATMA:極性注意とGated-Delta圧縮メモリによる長さ不変言語モデリング
Authors: Habibullah Akbar,
Abstract要約: ATMAは、新しい3チャンネルアテンション機構を統合するハイブリッドな畳み込みアテンションアーキテクチャである。我々は,100回分級アブレーションスイープを用いてATMAを評価し,Pola + メモリモデルの組み合わせが誘導針-a-haystack検索精度を90%以上維持することを示した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern large language models based on softmax scaled-dot-product attention are constrained by their training sequence length: as the key-value sequence grows, softmax probability mass can dilute across a wider distribution, inducing activation shift and long-context performance collapse. Moreover, long-context language modeling faces a structural tension: a sliding-window attention core maintains a bounded local representation and low perplexity but is blind to long-range dependencies, while full-context attention preserves global recall but suffers from out-of-distribution perplexity explosion. To resolve these limitations, we introduce ATMA, a hybrid convolutional-attention architecture that integrates a novel three-channel attention mechanism. ATMA factorizes the attention mixing step into: (1) a count-blind, unit-vector direction channel, (2) a bounded magnitude channel driven by the participation ratio of effective matches over an extreme-value-corrected null sink, and (3) a long-term recurrent compression memory optimized via a gated-delta fast-weights rule. Neither the Polar Attention core nor the recurrent memory is sufficient alone; their combination enables monotonic perplexity reduction and high-fidelity long-range retrieval simultaneously. We evaluate ATMA using a 100-run factorial ablation sweep, demonstrating that the combined Polar + memory model maintains induction needle-in-a-haystack retrieval accuracy above 90% out to 64K tokens (32 times the training length of 2K) while its document perplexity improves monotonically, outperforming softmax-based memory baselines which collapse at extreme context lengths. Code: https://github.com/kreasof-ai/atma
Abstract（参考訳）: 鍵値列が大きくなるにつれて、ソフトマックス確率質量はより広い分布にわたって希薄になり、アクティベーションシフトと長コンテキスト性能の崩壊を引き起こす。さらに、長期コンテキスト言語モデリングは、構造的な緊張に直面している。スライディングウィンドウアテンションコアは、局所的な境界表現と低いパープレキシティを維持しているが、長距離依存に盲目であり、フルコンテクストアテンションはグローバルリコールを保存するが、分布外パープレキシティの爆発に悩まされる。これらの制約を解決するために,新しい3チャンネルアテンション機構を組み込んだハイブリッド畳み込みアテンションアーキテクチャであるATMAを導入する。 ATMAは、(1)カウントブレンド、単位ベクトル方向チャネル、(2)極値補正されたヌルシンク上の有効マッチの参加率によって駆動される境界等級チャネル、(3)ゲートデルタ高速ウェイトルールにより最適化された長期再帰圧縮メモリに注意混合ステップを分解する。ポラリアテンションコアもリカレントメモリも単独では不十分であり、その組み合わせによって単調なパープレキシティの低減と高忠実な長距離検索が同時に実現される。我々は,100ランの係数アブレーションスイープを用いてATMAの評価を行い,Polal+メモリモデルが最大で90パーセント以上64Kトークン(2Kのトレーニング長の32倍)のインダクションニードル・イン・ア・ヘイスタック検索精度を維持する一方で,文書の難易度は単調に向上し,極+メモリモデルが極端文脈長で崩壊するソフトマックスベースのメモリベースラインよりも優れていることを示した。コード:https://github.com/kreasof-ai/atma

論文の概要: ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

関連論文リスト