Fugu-MT 論文翻訳(概要): Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

論文の概要: Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

arxiv url: http://arxiv.org/abs/2604.09611v1
Date: Thu, 12 Mar 2026 10:10:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.552467
Title: Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
Title（参考訳）: 大規模言語モデルの多要求ワークフローにおける性能-エネルギートレードオフの特徴付け
Authors: Md. Monzurul Amin Ifath, Israat Haque,
Abstract要約: 大規模言語モデル (LLM) は、マルチ要求システムを形成するアプリケーションでますます使われている。推論中にレイテンシとエネルギー需要を増幅する。本稿では,マルチ要求推論における性能-エネルギートレードオフの体系的特徴について述べる。
参考スコア（独自算出の注目度）: 0.8250374560598494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in applications forming multi-request workflows like document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored. To address these gaps, this paper presents the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. We develop four representative workloads capturing sequential, interactive, agentic, and composite patterns common in modern deployments. Using an NVIDIA A100 testbed with state-of-the-art serving systems (vLLM and Parrot), we analyze how key energy knobs affect latency, throughput, and component-level energy use. Our findings reveal batch size as the most impactful lever, though benefits are workload dependent. While optimal batching benefits workloads with large shared prompts, it is ineffective for sequential summarization and only partially effective for multi-agent coding. GPU power capping provides modest but predictable savings, while output length induces linear energy scaling with limited efficiency gains. We further show that engine-level optimizations in vLLM maintain higher GPU utilization and efficiency, especially for decode-heavy workloads, while Parrot's workflow-aware scheduling achieves lower energy consumption under strict power constraints. These findings offer actionable guidelines for developers and system operators designing performance- and energy-aware LLM serving systems in emerging multi-request workflows.
Abstract（参考訳）: 大規模言語モデル(LLM)は、文書要約、検索ベースのコピロ、マルチエージェントプログラミングといったマルチリクエストワークフローを形成するアプリケーションにおいて、ますます使われている。これらのワークフローは、よりリッチな機能を解放する一方で、推論時のレイテンシとエネルギー需要を増幅する。既存の測定とベンチマークの取り組みは、LLM推論システムの評価や、単一要求評価、ワークフロー依存性の見落とし、マルチ要求ワークフローに特有の相互要求インタラクションに重点を置いている。さらに、そのような相互依存型LLMコールのエネルギー利用については、未検討のままである。これらのギャップに対処するため,マルチリクエストLSM推論における性能-エネルギートレードオフの体系的評価を行った。現代のデプロイメントに共通するシーケンシャル、インタラクティブ、エージェント、複合パターンをキャプチャする4つの代表的なワークロードを開発します。最先端のサービスシステム(vLLMとParrot)を備えたNVIDIA A100テストベッドを使用して、重要なエネルギーノブがレイテンシ、スループット、コンポーネントレベルのエネルギー使用にどのように影響するかを分析する。その結果,バッチサイズが最も影響のあるレバーであることが判明した。バッチ処理は大きな共有プロンプトを持つワークロードの恩恵を受けるが、シーケンシャルな要約には有効ではなく、マルチエージェントコーディングには部分的に有効である。 GPUパワーキャッピングは、控えめだが予測可能な節約を提供し、出力長は、限られた効率向上で線形エネルギースケーリングを誘導する。さらに、エンジンレベルのvLLMの最適化は、特にデコード重負荷に対して、GPU利用率と効率を向上し、一方、Parrotのワークフロー対応スケジューリングは、厳しい電力制約下での低エネルギー消費を実現していることを示す。これらの知見は、新しいマルチリクエストワークフローにおいて、パフォーマンスとエネルギーを意識したLLMサービスシステムを設計する開発者やシステムオペレーターに対して実行可能なガイドラインを提供する。

論文の概要: Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

関連論文リスト