Fugu-MT 論文翻訳(概要): Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications

論文の概要: Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications

arxiv url: http://arxiv.org/abs/2601.08833v1
Date: Fri, 14 Nov 2025 06:42:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-25 16:54:51.649549
Title: Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications
Title（参考訳）: パフォーマンスとエネルギを考慮した非集約型大規模言語モデルの再検討
Authors: Jiaxi Li, Yue Zhu, Eun Kyung Lee, Klara Nahrstedt,
Abstract要約: 我々は、異なるKV転送媒体と最適化戦略の下で、プリフィル・デコードデアグリゲーションを再評価する。以上の結果から,プリフィル・デコード・デアグリゲーションによる性能向上は保証されず,要求負荷やKV転送媒体に依存することが明らかとなった。
参考スコア（独自算出の注目度）: 5.28675741509738
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Different from traditional Large Language Model (LLM) serving that colocates the prefill and decode stages on the same GPU, disaggregated serving dedicates distinct GPUs to prefill and decode workload. Once the prefill GPU completes its task, the KV cache must be transferred to the decode GPU. While existing works have proposed various KV cache transfer paths across different memory and storage tiers, there remains a lack of systematic benchmarking that compares their performance and energy efficiency. Meanwhile, although optimization techniques such as KV cache reuse and frequency scaling have been utilized for disaggregated serving, their performance and energy implications have not been rigorously benchmarked. In this paper, we fill this research gap by re-evaluating prefill-decode disaggregation under different KV transfer mediums and optimization strategies. Specifically, we include a new colocated serving baseline and evaluate disaggregated setups under different KV cache transfer paths. Through GPU profiling using dynamic voltage and frequency scaling (DVFS), we identify and compare the performance-energy Pareto frontiers across all setups to evaluate the potential energy savings enabled by disaggregation. Our results show that performance benefits from prefill-decode disaggregation are not guaranteed and depend on the request load and KV transfer mediums. In addition, stage-wise independent frequency scaling enabled by disaggregation does not lead to energy saving due to inherently higher energy consumption of disaggregated serving.
Abstract（参考訳）: 従来のLarge Language Model(LLM)と異なり、同じGPU上のプリフィルとデコードステージを共用する。プリフィルGPUがそのタスクを完了すると、KVキャッシュはデコードGPUに転送されなければならない。既存の作業では、さまざまなメモリ層とストレージ層にまたがるさまざまなKVキャッシュ転送パスが提案されているが、パフォーマンスとエネルギー効率を比較するための体系的なベンチマークはいまだに存在しない。一方、KVキャッシュの再利用や周波数スケーリングといった最適化手法は分散サービスに利用されてきたが、その性能とエネルギーへの影響は厳密なベンチマークでは評価されていない。本稿では,異なるKV転送媒体と最適化戦略の下で,プリフィル・デコード・デコード・デアグリゲーションを再評価することにより,この研究ギャップを埋める。具体的には、新しいコロケーションサービスベースラインを含み、異なるKVキャッシュ転送パスの下で分散セットアップを評価する。動的電圧および周波数スケーリング(DVFS)を用いたGPUプロファイリングにより、全ての設定において性能エネルギーのパレートフロンティアを特定し比較し、デアグリゲーションによって可能となる潜在的な省エネ効果を評価する。以上の結果から,プリフィル・デコード・デアグリゲーションによる性能向上は保証されず,要求負荷やKV転送媒体に依存することが明らかとなった。さらに、分解によって実現される段階的に独立した周波数スケーリングは、分解されたサーブのエネルギー消費が本質的に高いため、省エネには至らない。

論文の概要: Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications

関連論文リスト