Fugu-MT 論文翻訳(概要): FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

論文の概要: FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

arxiv url: http://arxiv.org/abs/2603.06199v1
Date: Fri, 06 Mar 2026 12:12:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.674351
Title: FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
Title（参考訳）: Flash Prefill: 超長期プレフィルのための即時パターン発見と閾値設定
Authors: Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, Ran He,
Abstract要約: FlashPrefillは、瞬時パターン発見としきい値設定による超高速プリフィルを可能にするフレームワークである。 FlashPrefillは256Kシーケンスで前例のない27.78倍の高速化を実現している。短いコンテキストで効率を劣化させる既存の方法とは異なり、FlashPrefillは4Kコンテキスト長でも1.71倍のスピードアップを維持している。
参考スコア（独自算出の注目度）: 43.057651076580264
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.
Abstract（参考訳）: 長期コンテキストモデリングは、大規模言語モデルにとって重要な能力であるが、特に計算集約的な準備段階において、注意の二次的な複雑さは重要なボトルネックである。様々な注意機構が検討されているが、それらは典型的には大きな検索遅延または不十分な間隔のいずれかに悩まされている。本稿では,瞬時パターン発見としきい値設定による超高速プリフィルを実現するフレームワークであるFlashPrefillを提案する。 FlashPrefillは、高速なブロック探索技術を利用して、動的垂直、スラッシュ、およびブロックスパースアテンションパターンを同時に検出する。重要なことに、これはダイナミックなしきい値設定機構を導入し、ソートや注意点の蓄積の禁止的オーバーヘッドを回避し、長い尾の分布を効果的に排除し、空間性を高める。大規模な評価では、FlashPrefillは256Kシーケンスで前例のない27.78倍の高速化を実現している。特に、短いコンテキストで効率を劣化させる既存の方法とは異なり、FlashPrefillは4Kコンテキスト長でも1.71倍のスピードアップを維持し、その堅牢性と様々なシーケンススケールでの実用性を実証している。

論文の概要: FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

関連論文リスト