Fugu-MT 論文翻訳(概要): Retrospective Sparse Attention for Efficient Long-Context Generation

論文の概要: Retrospective Sparse Attention for Efficient Long-Context Generation

arxiv url: http://arxiv.org/abs/2508.09001v1
Date: Tue, 12 Aug 2025 15:11:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.473946
Title: Retrospective Sparse Attention for Efficient Long-Context Generation
Title（参考訳）: 効率的な長文脈生成のための反省的スパースアテンション
Authors: Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim,
Abstract要約: RetroAttentionは、後続の復号ステップから新たに到着したKVエントリを使用して、過去の注意出力を遡及的に更新する。これは固定アテンション・アウトプットのパラダイムを破り、事前近似の継続的な修正を可能にする。実験により、RetroAttention は最先端(SOTA) KV 圧縮法より一貫して優れていることが示された。
参考スコア（独自算出の注目度）: 5.562294018150909
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%.
Abstract（参考訳）: 大規模言語モデル(LLM)は、推論、コード生成、マルチターンダイアログといった長いコンテキストタスクに徐々にデプロイされている。しかし、拡張コンテキストに対する推論はキーバリュー(KV)キャッシュによってボトルネックとなり、メモリフットプリントはシーケンス長とともに線形に増加し、デコードステップ毎にレイテンシを支配される。最近のKVキャッシュ圧縮手法は重要なトークンを識別してロードするが、主に入力コンテキストに焦点を当て、長い復号時に発生する累積的な注意誤差に対処できない。本稿では,新しいKVキャッシュ更新手法であるRetroAttentionを紹介する。軽量な出力キャッシュを維持することで、RetroAttentionは、遅延のオーバーヘッドを最小限に抑えながら、過去のクエリがより関連性の高いコンテキストに効率的にアクセスできるようにする。これは固定アテンション・アウトプットのパラダイムを破り、事前近似の継続的な修正を可能にする。長期にわたるベンチマーク実験により、RetroAttentionは最先端(SOTA)のKV圧縮手法を一貫して上回り、有効なKV露光を1.6$\times$まで増加させ、精度を21.9\%まで向上させた。

論文の概要: Retrospective Sparse Attention for Efficient Long-Context Generation

関連論文リスト