Fugu-MT 論文翻訳(概要): Adaptively Robust LLM Inference Optimization under Prediction Uncertainty

論文の概要: Adaptively Robust LLM Inference Optimization under Prediction Uncertainty

arxiv url: http://arxiv.org/abs/2508.14544v1
Date: Wed, 20 Aug 2025 08:55:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.401808
Title: Adaptively Robust LLM Inference Optimization under Prediction Uncertainty
Title（参考訳）: 予測不確かさ下における適応ロバストLLM推論最適化
Authors: Zixi Chen, Yinyu Ye, Zijie Zhou,
Abstract要約: 本稿では,Large Language Model (LLM) 推論スケジューリングを最適化し,全遅延を最小化する問題について検討する。 LLM推論の鍵となる課題は、実行時の長さが分かる一方で、メモリ使用量や処理時間に重大な影響を及ぼす出力長が不明であることである。本稿では,各要求に対して間隔分類(min-max range)を提供すると仮定して,機械学習を利用して出力長を予測するアルゴリズムを提案する。
参考スコア（独自算出の注目度）: 3.4858872019721447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that while the prompt length is known upon arrival, the output length, which critically impacts memory usage and processing time, is unknown. To address this uncertainty, we propose algorithms that leverage machine learning to predict output lengths, assuming the prediction provides an interval classification (min-max range) for each request. We first design a conservative algorithm, $\mathcal{A}_{\max}$, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose $\mathcal{A}_{\min}$, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that $\mathcal{A}_{\min}$ achieves a log-scale competitive ratio. Through numerical simulations, we demonstrate that $\mathcal{A}_{\min}$ often performs nearly as well as the hindsight scheduler, highlighting both its efficiency and robustness in practical scenarios. Moreover, $\mathcal{A}_{\min}$ relies solely on the lower bound of the prediction interval--an advantageous design choice since upper bounds on output length are typically more challenging to predict accurately.
Abstract（参考訳）: 本稿では,Large Language Model (LLM) 推論スケジューリングを最適化し,全遅延を最小化する問題について検討する。 LLM推論は、オンラインおよびマルチタスクのサービスプロセスであり、事前訓練されたLLMが入力要求を処理し、出力トークンを逐次生成するエネルギー消費も行う。そのため、大量の要求が届きつつ、スケジューリング効率の向上と消費電力削減が不可欠である。 LLM推論の鍵となる課題は、実行時の長さが分かる一方で、メモリ使用量や処理時間に重大な影響を及ぼす出力長が不明であることである。この不確実性に対処するために、各要求に対して間隔分類(min-max range)を提供すると仮定して、機械学習を利用して出力長を予測するアルゴリズムを提案する。まず、予測出力長の上限値に基づいて要求をスケジュールし、メモリオーバーフローを防止する保守的なアルゴリズムである$\mathcal{A}_{\max}$を設計する。しかし、このアプローチは過度に保守的であり、予測精度が低下するにつれて、潜在的な過大評価のために性能が著しく低下する。この制限を克服するために、まず予測下限を出力長として扱い、推論中にこの推定値を動的に洗練する適応アルゴリズムである$\mathcal{A}_{\min}$を提案する。我々は$\mathcal{A}_{\min}$が対数スケールの競合比を達成することを証明した。数値シミュレーションにより、$\mathcal{A}_{\min}$は後向きスケジューラとほぼ同等の性能を示し、実用シナリオにおける効率性と堅牢性を強調している。さらに、$\mathcal{A}_{\min}$は、予測間隔の下限のみに依存する。

論文の概要: Adaptively Robust LLM Inference Optimization under Prediction Uncertainty

関連論文リスト