Fugu-MT 論文翻訳(概要): Kwai Summary Attention Technical Report

論文の概要: Kwai Summary Attention Technical Report

arxiv url: http://arxiv.org/abs/2604.24432v1
Date: Mon, 27 Apr 2026 12:59:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.012253
Title: Kwai Summary Attention Technical Report
Title（参考訳）: Kwai概要報告
Authors: Chenglong Chu, Guorui Zhou, Guowang Zhang, Han Li, Hao Peng, Hongtao Cheng, Jian Liang, Jiangxia Cao, Kun Gai, Lingzhi Zhou, Lu Ren, Qi Zhang, Ruiming Tang, Ruitao Wang, Xinchen Luo, Yi Su, Zhiyuan Liang, Ziqi Wang, Boyang Ding, Chengru Song, Dunju Zang, Hui Wang, Jiao Ou, Jiaxin Deng, Jijun Shi, Jinghao Zhang, Junmin Chen, Lejian Ren, Minxuan Lv, Qianqian Wang, Qigen Hu, Shiyao Wang, Siyang Mao, Tao Wang, Xingmei Wang, Zhixin Ling, Ziming Li, Zixing Zhang,
Abstract要約: 長文の能力は、次世代の大規模言語モデルの最も重要な方向性の1つになっている。標準ソフトマックスアテンションは、シーケンスの長さに関して2次時間複雑性を示す。歴史的文脈を圧縮することでシーケンスモデリングコストを削減する新しいアテンションメカニズムであるKwai Summary Attention (KSA)を提案する。
参考スコア（独自算出の注目度）: 69.40814939510126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.
Abstract（参考訳）: 長期コンテキスト能力は、特に意味理解/推論、コードエージェントインテリジェンス、レコメンデーションシステムにおいて、次世代の大規模言語モデルの最も重要な反復方向の1つになっている。しかし、標準的なソフトマックスアテンションは、シーケンス長に関して2次時間複雑性を示す。シーケンスの長さが増加するにつれて、長いコンテキスト設定でかなりのオーバーヘッドが発生し、非常に長いシーケンスのトレーニングと推論コストが急速に低下する。既存のソリューションは、この問題を2つのテクニックのルーティングを通じて緩和する。 i)ヘッドレベル圧縮GQAや埋め込み次元圧縮MLAなどの層ごとのKVキャッシュを削減するが、KVキャッシュは1:1の比率でシーケンス長に線形に依存する。二ローカルアテンションSWA、リニアカーネルGDNのようなKVキャッシュフレンドリーなアーキテクチャと相互運用するが、KVキャッシュ間のトレードオフや長文モデリングの有効性がしばしば発生する。 KVキャッシュとシーケンス長の線形関係を持つが、特定の比$k$}で意味レベルの圧縮を行う。この$O(n/k)$ path は ``minimum KV cache'' を追求するのではなく、長い依存の完全、参照、解釈可能な保持のために許容されるメモリコストを交換する。そこで本稿では,歴史的文脈を学習可能な要約トークンに圧縮することでシーケンスモデリングコストを削減できる新しい注意機構であるKwai Summary Attention (KSA)を提案する。

論文の概要: Kwai Summary Attention Technical Report

関連論文リスト