Fugu-MT 論文翻訳(概要): FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

論文の概要: FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

arxiv url: http://arxiv.org/abs/2602.05305v1
Date: Thu, 05 Feb 2026 04:57:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:08.764784
Title: FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
Title（参考訳）: FlashBlock: 効率的な長期ブロック拡散のための注意キャッシュ
Authors: Zhuokun Chen, Jianfei Cai, Bohan Zhuang,
Abstract要約: FlashBlockは、安定したアテンション出力を再利用し、拡散プロセスを変更することなくアテンション計算とKVキャッシュアクセスを減らす、キャッシュされたブロック外部アテンションメカニズムである。拡散言語モデルとビデオ生成の実験では、1.44$times$高いトークンスループットと1.6$times$の注意時間を短縮し、生成品質に無視できない影響を与えている。
参考スコア（独自算出の注目度）: 51.1618564189244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
Abstract（参考訳）: 分長ビデオや拡張テキストなどの長文コンテンツの生成は、現代の生成モデルにとってますます重要になっている。ブロック拡散はKVキャッシングとブロックワイズ因果推論によって推論効率を向上し、拡散言語モデルやビデオ生成に広く採用されている。しかし、長いコンテキスト設定では、ブロック拡散は、成長するKVキャッシュに対する繰り返しのコンピューティングの注意からかなりのオーバーヘッドを引き起こす。ブロック拡散の未探索特性を,ブロック内の注意の相互冗長性として同定する。解析の結果,現在のブロックの外側のトークンからの注意出力は拡散段階にわたってほぼ安定であり,ブロック内部の注意は著しく変化していることがわかった。この観測に基づいて,FlashBlockを提案する。これは,安定した注意出力を再利用し,注意計算やKVキャッシュアクセスを拡散過程を変更することなく削減する,キャッシュ付きブロック外部アテンション機構である。さらに、FlashBlockは注意をまき散らすのに直交しており、補助的な残留再利用戦略として組み合わせることができる。拡散言語モデルとビデオ生成の実験では、1.44$\times$高いトークンスループット、1.6$\times$の注意時間削減、生成品質への無視的な影響が示されている。プロジェクトページ: https://caesarhh.github.io/FlashBlock/。

論文の概要: FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

関連論文リスト