Fugu-MT 論文翻訳(概要): Learning When to Attend: Conditional Memory Access for Long-Context LLMs

論文の概要: Learning When to Attend: Conditional Memory Access for Long-Context LLMs

arxiv url: http://arxiv.org/abs/2603.17484v1
Date: Wed, 18 Mar 2026 08:48:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.587207
Title: Learning When to Attend: Conditional Memory Access for Long-Context LLMs
Title（参考訳）: 長期LLMのための条件付きメモリアクセス
Authors: Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, Stefano Soatto,
Abstract要約: 言語モデルは、事前訓練された文脈長を超えて一般化するのに苦労する。本稿では,L2A(Learning To Attend)を提案する。 L2Aは標準的な長文トレーニングのパフォーマンスを3%以内にし、Global Attentionを$sim$80%のトークンでスキップする。
参考スコア（独自算出の注目度）: 46.51137149612742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.
Abstract（参考訳）: 言語モデルは、事前訓練された文脈長を超えて一般化し、長い水平推論と検索を制限するのに苦労する。長いコンテキストデータに対する事前トレーニングは、注意の二次的なスケーリングのために役立ちますが、高価です。ほとんどのトークンはシーケンス全体に対する(Global)アテンションを必要とせず、ローカルコンテキストに依存することができる。そこで我々は,L2A(Learning To Attend)を提案する。L2A(Learning To Attend)は,グローバルアテンションをいつ呼び出すかを決めて,条件付き(トークン的に)長距離メモリアクセスを可能にするレイヤである。我々は,Qwen 2.5およびQwen 3モデルのL2Aを評価し,有効コンテキスト長を32Kから128Kに拡張した。 L2Aは、標準的な長文トレーニングのパフォーマンスを3%以内にし、Global Attentionを$\sim$80%のトークンでスキップし、以前のベースラインを上回っている。また、カスタムのTritonカーネルを設計して、このトークン単位のコンディショナルアテンションをGPU上で効率的に実装し、FlashAttention上でのトレーニングスループットとタイム・ツー・ファーストの処理で最大$\sim$2xの改善を実現しました。さらに、L2Aは、高度にスパースなグローバルアテンションレイヤのトレーニング後プルーニングを可能にし、KVキャッシュメモリを最大50%削減する。

論文の概要: Learning When to Attend: Conditional Memory Access for Long-Context LLMs

関連論文リスト