Fugu-MT 論文翻訳(概要): LongFlow: Efficient KV Cache Compression for Reasoning M

論文の概要: LongFlow: Efficient KV Cache Compression for Reasoning M

arxiv url: http://arxiv.org/abs/2603.11504v1
Date: Thu, 12 Mar 2026 03:46:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.848791
Title: LongFlow: Efficient KV Cache Compression for Reasoning M
Title（参考訳）: LongFlow: 推論Mのための効率的なKVキャッシュ圧縮
Authors: Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang,
Abstract要約: LongFlow は KV キャッシュ圧縮手法であり,効率の良い重要度推定法である。 LongFlowは最大11.8倍のスループット向上を実現し、80%のKVキャッシュ圧縮を実現している。
参考スコア（独自算出の注目度）: 40.00703310813227
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.
Abstract（参考訳）: OpenAI-o1やDeepSeek-R1といった最近の推論モデルは、数学的推論やコード生成など複雑なタスクにおいて強力なパフォーマンスを示している。しかしながら、このパフォーマンス向上には、出力シーケンスが大幅に長くなり、デプロイメントコストが大幅に増加します。特に、長い出力は大きなKVキャッシュを必要とするため、注意計算時に高いメモリ消費と厳しい帯域幅の圧力が発生する。既存のKVキャッシュ最適化手法の多くは、長期出力、短期出力のシナリオのために設計されており、推論モデルの長期出力設定には有効ではない。さらに, 連続的な再評価が必要な場合, 先行作業における重要度評価は計算コストが高く, 禁止となる。これらの課題に対処するために,現在のクエリのみを用いた注意計算の中間結果から導出した,効率的な重要度推定値を持つKVキャッシュ圧縮手法であるLongFlowを提案する。この設計は、無視可能な計算オーバーヘッドを導入し、補助記憶を必要としない。我々はさらに、FlashAttention, importance estimation, and token evictionを単一の最適化演算子に融合させるカスタムカーネルを開発し、システムレベルの効率を向上する。実験によると、LongFlowは最大11.8倍のスループット向上を実現し、80%のKVキャッシュ圧縮を実現し、モデルの精度に最小限の影響を及ぼしている。

論文の概要: LongFlow: Efficient KV Cache Compression for Reasoning M

関連論文リスト