Fugu-MT 論文翻訳(概要): NOSA: Native and Offloadable Sparse Attention

論文の概要: NOSA: Native and Offloadable Sparse Attention

arxiv url: http://arxiv.org/abs/2510.13602v1
Date: Wed, 15 Oct 2025 14:33:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.711567
Title: NOSA: Native and Offloadable Sparse Attention
Title（参考訳）: NOSA: ネイティブでオフロード可能なスパース注意
Authors: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu,
Abstract要約: 我々は、KVキャッシュオフロードをサポートするために設計された訓練可能なスパースアテンションフレームワークであるNOSAを提案する。我々はNOSAが復号スループットを最大2.3倍に向上させながら、ほぼロスレス性能を保っていることを示す。
参考スコア（独自算出の注目度）: 27.551376861663556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).
Abstract（参考訳）: 長いコンテキスト処理においてLLMの復号効率ボトルネックに対処し、メモリアクセスを大幅に削減し、タスク性能を最小限に抑えるための有望なソリューションとして、訓練可能なスパース・アテンションが登場した。キー値(KV)キャッシュのサイズは、GPU上のバッチサイズやスロットルによるデコードスループットの制限、特に大規模バッチ推論では、未処理のままである。本稿では、訓練可能なスパースアテンションが、隣接するデコードステップ間でトークン選択の強い局所性を示すことを示し、それによって、基礎となるアテンション計算を変更することなく、KVキャッシュのオフロードを可能にする。しかし、CPUとGPU間の選択されたKVペアの転送が全体的なデコードコストを支配し続けているため、ローカル性は効率的なオフロードを実現するには不十分である。この知見に基づいて、KVキャッシュのオフロードをネイティブにサポートするために設計されたトレーニング可能なスパースアテンションフレームワークであるNOSAを提示する。 NOSAは、トークン選択をクエリ対応およびクエリ非依存のコンポーネントに分解することで、明示的な局所性制約を導入する。我々はNOSAで1Bパラメータモデルを事前訓練し、広範囲なベンチマークを行い、バニラトレーニング可能なスパースアテンションベースライン(InfLLM-V2)と比較して2.3倍のスループットの復号化を実現しつつ、ほぼロスレス性能を維持していることを示す。

論文の概要: NOSA: Native and Offloadable Sparse Attention

関連論文リスト