Fugu-MT 論文翻訳(概要): Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

論文の概要: Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

arxiv url: http://arxiv.org/abs/2606.22327v1
Date: Sun, 21 Jun 2026 04:05:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:05:30.008442
Title: Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice
Title（参考訳）: LLMサービングのための幾何学的オンラインスケジューリング:理論境界からシステム実践へ
Authors: Li Kong, Qi Qi, Yinyu Ye, Zijie Zhou,
Abstract要約: 本稿では,最小ボリュームファースト(SVF)アルゴリズムによる幾何対応オンラインスケジューリングと,その高効率な1ビットSVFを提案する。 SVFは平均レイテンシとテールレイテンシの両方を強く削減し、一方1ビットのSVFは1ビットの情報だけで、競合スループットとレイテンシを実現する。この研究は、現代のデプロイメントにおけるメモリアウェアスケジューリングを解決するための理論的に健全で実証的なアプローチを確立する。
参考スコア（独自算出の注目度）: 12.809268527016599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The explosive demand for interactive Large Language Model serving has highlighted the management of the Key-Value cache's dynamic memory footprint as a critical area for performance optimization in inference engines. Modern inference systems overwhelmingly rely on time-centric scheduling heuristics, such as Shortest Job First. However, their theoretical optimality is rooted in traditional schedule modeling, failing to capture the highly dynamic, 2D spatio-temporal geometric growth specific to LLM inference mechanisms. To resolve this, we propose the geometry-aware online scheduling by introducing the Smallest Volume First (SVF) algorithm and its highly efficient variant, 1-bit SVF. Theoretically, we provide a rigorous mathematical foundation for our approach. Utilizing a novel proof methodology, we tighten the worst-case competitive ratio ($\text{CR} \le 48 \rightarrow \text{CR} \le 5$) for SVF with known output lengths. Building upon this core breakthrough, we complete a comprehensive theoretical taxonomy analyzing our algorithms across different traffic scenarios and information availability. Practically, we seamlessly integrate our approach as a plug-and-play layer in vLLM. Extensive evaluations on Llama-3.1 models demonstrate comprehensive performance gains: SVF delivers strong reductions in both average and tail latency, while 1-bit SVF, with merely a single bit information, achieves competitive throughput and latency. This work establishes a theoretically sound and empirically proven approach for resolving memory-constrained scheduling in modern LLM deployments. To facilitate future research, our code is available at https://github.com/Aurora-Kl/Geometry-Aware-Online-Scheduling.git.
Abstract（参考訳）: インタラクティブなLarge Language Modelサービスに対する爆発的な需要は、推論エンジンのパフォーマンス最適化のための重要な領域として、Key-Valueキャッシュの動的メモリフットプリントの管理を強調している。現代の推論システムは、最短のジョブファーストのような時間中心のスケジューリングヒューリスティックに依存している。しかし、それらの理論的最適性は、従来のスケジュールモデリングに根ざしており、LLM推論機構に特有の高度に動的で2次元の時空間的な成長を捉えられなかった。そこで本研究では,最小ボリュームファースト(SVF)アルゴリズムとその高効率な変種である1ビットSVFを導入して,幾何を考慮したオンラインスケジューリングを提案する。理論的には、我々のアプローチに厳密な数学的基礎を提供する。新たな証明手法を用いることで、既知の出力長を持つSVFに対して最悪の競合比(\text{CR} \le 48 \rightarrow \text{CR} \le 5$)を締め付ける。この中核的なブレークスルーに基づいて、さまざまなトラフィックシナリオと情報可用性にまたがるアルゴリズムを分析する包括的な理論的分類を完成させる。実際、我々はvLLMのプラグイン・アンド・プレイ層として我々のアプローチをシームレスに統合する。 Llama-3.1モデルの大規模な評価は、全体的なパフォーマンス向上を示している: SVFは平均レイテンシとテールレイテンシの両方を強く削減し、一方、1ビットのSVFは、単一のビット情報しか持たず、競合的なスループットとレイテンシを実現する。この研究は、現代のLLMデプロイメントにおけるメモリ制約スケジューリングを解決するための理論的に健全で実証的なアプローチを確立する。将来の研究を促進するため、我々のコードはhttps://github.com/Aurora-Kl/Geometry-Aware-Online-Scheduling.git.comで入手できる。

論文の概要: Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

関連論文リスト