Fugu-MT 論文翻訳(概要): Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

論文の概要: Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

arxiv url: http://arxiv.org/abs/2509.04467v1
Date: Fri, 29 Aug 2025 02:29:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-08 14:27:25.301839
Title: Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference
Title（参考訳）: LLM効率の向上:推論におけるプリフィル・デコード分散のためのターゲットプルーニング
Authors: Hao Zhang, Mengsi Lyu, Yulong Ao, Yonghua Lin,
Abstract要約: 大規模言語モデル(LLM)は、様々なタスクにまたがる例外的な能力を示すが、その展開は高い計算とメモリコストに制約される。より正確で効率的なブロックとKVキャッシュのプルーニングを可能にするPD分散推論のための新しいプルーニング法を提案する。
参考スコア（独自算出の注目度）: 5.127648076034455
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands. However, existing methods often ignore the characteristics of prefill-decode (PD) disaggregation in practice. In this paper, we propose a novel pruning method for PD disaggregation inference, enabling more precise and efficient block and KV Cache pruning. Our approach constructs pruning and distillation sets to perform iterative block removal independently for the prefill and decode stages, obtaining better pruning solutions. Moreover, we introduce a token-aware cache pruning mechanism that retains all KV Cache in the prefill stage but selectively reuses entries for the first and last token sequences in selected layers during decode, reducing communication costs with minimal overhead. Extensive experiments demonstrate that our approach consistently achieves strong performance in both PD disaggregation and PD unified settings without disaggregation. Under the default settings, our method achieves a 20.56% inference speedup and a 4.95 times reduction in data transmission bandwidth consumption.
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々なタスクにまたがる例外的な能力を示すが、その展開は高い計算とメモリコストに制約される。モデルプルーニングは、これらの要求を緩和する効果的な手段を提供する。しかし,既存の手法では,プリフィル・デコード(PD)のデアグリゲーションの特性を無視することが多い。本稿では,PDデアグリゲーション推論のための新しいプルーニング手法を提案し,より正確で効率的なブロックとKVキャッシュのプルーニングを実現する。本手法は, プレフィルおよびデコード段階で独立に反復的ブロック除去を行うために, プルーニングおよび蒸留セットを構築し, より優れたプルーニング解を得る。さらに、プリフィル段階ではすべてのKVキャッシュを保持するが、デコード時に選択された層における第1および最後のトークンシーケンスのエントリを選択的に再利用し、通信コストを最小限のオーバーヘッドで削減するトークン対応キャッシュプルーニング機構を導入する。広汎な実験により, PD分散化とPD統合化の両面において, 解離を伴わない強い性能が得られた。デフォルト設定では,推定速度が20.56%向上し,データ伝送帯域幅の4.95倍の削減を実現している。

論文の概要: Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

関連論文リスト