Fugu-MT 論文翻訳(概要): ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

論文の概要: ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

arxiv url: http://arxiv.org/abs/2605.23081v1
Date: Thu, 21 May 2026 22:28:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.124103
Title: ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
Title（参考訳）: ThriftAttention: 長期FP4注意のための選択混合精度
Authors: Joe Sharratt,
Abstract要約: 提案するThriftAttentionは,FP4推論効率で約FP16の長文品質を実現する低ビットアテンションバリアントである。我々は、FP16のクエリキーブロックの5%しか計算できないという長文のベンチマークとモデルファミリで、ThriftAttentionはFP4からFP16のパフォーマンスギャップの89.1%で回復することを示した。
参考スコア（独自算出の注目度）: 0.12691047660244334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.
Abstract（参考訳）: 効率的な注意アルゴリズムは、長期業務における注意の二次的コストを軽減するために重要である。これまでの作業では、Blackwell GPU上のブロックスケールの量子化技術を使用して、アテンション計算を4ビット精度に移動して推論を高速化する。しかし、これらの手法は長文設定において大幅な品質劣化をもたらす。量子化エラーの出力効果は,各クエリキーの相互作用の重要性によって大きくなり,最も重要なトークンを含む少数の注意ブロックにおいて機能的に関連性のあるエラーが集中していることを示す。提案するThriftAttentionは,FP4推論効率で約FP16の長文品質を実現する低ビットアテンションバリアントである。このアプローチは2段階に進む。まず、ヒューリスティックはFP16精度のために、少数の重要なクエリキーブロックペアを迅速に選択する。第2に、選択されたブロックはFP16で計算され、残りのブロックはFP4で計算され、両方のパスはオンラインソフトマックスを介して単一の出力にマージされる。我々は、FP16のクエリキーブロックの5%しか計算できないという長文のベンチマークとモデルファミリで、ThriftAttentionはFP4からFP16のパフォーマンスギャップの89.1%で回復することを示した。 ThriftAttentionの利点はシーケンス長とともに増大し、より長いコンテキストで観察される系統的なFP4品質劣化を緩和する。コードはhttps://github.com/joesharratt1229/ThriftAttentionで公開されている。

論文の概要: ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

関連論文リスト