Fugu-MT 論文翻訳(概要): LLM Serving Optimization with Variable Prefill and Decode Lengths

論文の概要: LLM Serving Optimization with Variable Prefill and Decode Lengths

arxiv url: http://arxiv.org/abs/2508.06133v1
Date: Fri, 08 Aug 2025 08:54:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.155534
Title: LLM Serving Optimization with Variable Prefill and Decode Lengths
Title（参考訳）: 可変プリフィルとデコード長を用いたLLMサービング最適化
Authors: Meixuan Wang, Yinyu Ye, Zijie Zhou,
Abstract要約: 本研究では,各要求が不均一なプレフィルとデコード長を持つLLM要求(Large Language Model)を提供する問題について検討する。この問題は、配置制約の相互運用、優先関係、メモリ使用量の線形増加などによりNPハードであることが示される。本稿では,時間とともに効率よくバッチを生成する新しい選択基準に基づく新しいアルゴリズムを提案する。
参考スコア（独自算出の注目度）: 2.666596421430287
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of serving LLM (Large Language Model) requests where each request has heterogeneous prefill and decode lengths. In LLM serving, the prefill length corresponds to the input prompt length, which determines the initial memory usage in the KV cache. The decode length refers to the number of output tokens generated sequentially, with each additional token increasing the KV cache memory usage by one unit. Given a set of n requests, our goal is to schedule and process them to minimize the total completion time. We show that this problem is NP-hard due to the interplay of batching, placement constraints, precedence relationships, and linearly increasing memory usage. We then analyze commonly used scheduling strategies in practice, such as First-Come-First-Serve (FCFS) and Shortest-First (SF), and prove that their competitive ratios scale up sublinearly with the memory limit-a significant drawback in real-world settings where memory demand is large. To address this, we propose a novel algorithm based on a new selection metric that efficiently forms batches over time. We prove that this algorithm achieves a constant competitive ratio. Finally, we develop and evaluate a few algorithm variants inspired by this approach, including dynamic programming variants, local search methods, and an LP-based scheduler, demonstrating through comprehensive simulations that they outperform standard baselines while maintaining computational efficiency.
Abstract（参考訳）: 各要求が不均一なプレフィルとデコード長を持つLLM要求(Large Language Model)を提供する問題について検討する。 LLMサービスでは、プリフィル長は入力プロンプト長に対応し、KVキャッシュにおける初期メモリ使用量を決定する。デコード長は連続的に生成される出力トークンの数を指し、各追加トークンはKVキャッシュメモリ使用量を1単位増やす。 n リクエストのセットが与えられた場合、我々のゴールは、全完了時間を最小化するためにスケジュールと処理を行うことです。この問題は,バッチ処理や配置制約,優先関係,メモリ使用量の増加などによるNPハードな問題であることが示される。次に,FCFS (First-Come-First-Serve) やSF (Shortest-First) などの一般的なスケジューリング手法を解析し,メモリ需要が大きい実世界の環境において,その競合比がメモリ限界に比例して増加することを証明した。そこで本研究では,時間とともに効率よくバッチを生成する新しい選択基準に基づく新しいアルゴリズムを提案する。我々は,このアルゴリズムが一定の競合比を達成することを証明した。最後に、動的プログラミングの変種、局所探索法、LPベースのスケジューラなど、このアプローチにインスパイアされたいくつかのアルゴリズム変種を開発し、評価し、計算効率を保ちながら標準ベースラインより優れていることを示す。

論文の概要: LLM Serving Optimization with Variable Prefill and Decode Lengths

関連論文リスト