Fugu-MT 論文翻訳(概要): LUCID: Attention with Preconditioned Representations

論文の概要: LUCID: Attention with Preconditioned Representations

arxiv url: http://arxiv.org/abs/2602.10410v1
Date: Wed, 11 Feb 2026 01:46:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-12 21:44:01.367083
Title: LUCID: Attention with Preconditioned Representations
Title（参考訳）: LUCID:事前条件付き表現による注意
Authors: Sai Surya Duvvuri, Nirmal Patel, Nilesh Gupta, Inderjit S. Dhillon,
Abstract要約: LUCIDアテンション(LUCID Attention)は,アテンション確率にプレコンディショナーを適用するアーキテクチャ変更である。このプレコンディショナーは、指数化キーキー類似性から派生したもので、再生カーネルヒルベルト空間内のキー間の重なりを最小限にする。最大128Kトークンで評価された10億のパラメータ言語モデルをトレーニングすることで、我々のアプローチを検証する。
参考スコア（独自算出の注目度）: 14.98859684869003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.
Abstract（参考訳）: ソフトマックスベースのドットプロダクトアテンションはTransformerアーキテクチャの基盤であり、コンテキスト内学習のような優れた機能を実現する。しかし、文脈の長さが長くなるにつれて、ソフトマックス関数の基本的な制限が出現し、確率質量を無関係なトークンに拡散する傾向にあり、長期のシナリオでは性能が低下する。さらに,ソフトマックス温度を下げることにより,勾配の消失による学習性を低下させることによって焦点を絞り込もうとする試みも行われた。 LUCIDアテンション(LUCID Attention)は、注意確率にプレコンディショナーを適用するアーキテクチャ変更である。このプレコンディショナーは、指数化キーキー類似性から派生したもので、再生カーネルヒルベルト空間内のキー間の重なりを最小化し、クエリは標準の注意と同じ計算量で、多数のキー間の重要なキーに正確に焦点を合わせることができる。さらに、LUCIDの事前条件に基づく検索アプローチは、低温の必要性とそれに関連する学習可能性の問題を回避している。最大128Kトークンで評価された10億のパラメータ言語モデルをトレーニングすることで、我々のアプローチを検証する。以上の結果から,BABILong,RULER,SCROLLS,LongBenchなどの長文検索タスクに有意な改善が認められた。例えば、LUCIDはBABILongの最大18%の改善、RULERのマルチニードル性能の14%の改善を実現している。

論文の概要: LUCID: Attention with Preconditioned Representations

関連論文リスト