Fugu-MT 論文翻訳(概要): Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

論文の概要: Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

arxiv url: http://arxiv.org/abs/2605.21312v1
Date: Wed, 20 May 2026 15:40:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.760034
Title: Frontier: Towards Comprehensive and Accurate LLM Inference Simulation
Title（参考訳）: Frontier: 総合的かつ正確なLLM推論シミュレーションを目指して
Authors: Yicheng Feng, Xin Tan, Yangtao Deng, Yimin Jiang, Yibo Zhu, Hong Xu,
Abstract要約: 本稿では,現代のLLM推論サービスのための離散イベントシミュレータであるFrontierを紹介する。これは、コロケーション、プリフィル・デコード・デアグリゲーション(PDD)、アテンション・FFN・デアグリゲーション(AFD)をモデル化することで、現代のサービスシステムの構造ダイナミクスを捉える。 16-H800 GPUテストベッドでは、Frontierは4%未満の平均エラーを達成した。最先端のシミュレータと比較すると、コロケーション時のエンドツーエンドのレイテンシエラーは44.9%から6.4%に減少し、デアグリゲーション時の52.6%に低下する。
参考スコア（独自算出の注目度）: 15.58999342618182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration.
Abstract（参考訳）: 現代のLLMは、もはや均質でもモノリシックでもない。プロダクションシステムは、分散実行、複雑な並列処理、ランタイム最適化、推論、エージェント、RLロールアウトといったステートフルなワークロードを組み合わせる。シミュレーションは、この成長するデザイン空間を探索するのに魅力的なものだが、既存のシミュレータは、要求されるアーキテクチャの完全性と決定グレードの忠実さを欠いている。それらのモノリシック-レプリカの抽象化は、分離されたサービスに不適であり、平均ケース分析プロキシはSLA予測を歪ませたり、最適化の結論を逆転させることもできる。本稿では,現代のLLM推論サービスのための離散イベントシミュレータであるFrontierを紹介する。 Frontierは非集約的な抽象化を備えている。ロール固有のクラスタワーカーとのコロケーション、Prefill-Decode Disaggregation(PDD)、Attention-FFN Disaggregation(AFD)をモデル化し、スケジューラ-バッチ-エンジンループ内に主要なランタイム最適化(CUDAグラフ、投機的デコーディング)を組み込むことで、現代的なサービスシステムの構造とダイナミクスをキャプチャし、新興ワークロードに対するステートフルな要求をサポートする。さらに、計算、通信、メモリコストの正確で一般化可能な予測を、複雑なワークロード構成を持つさまざまなサービスシナリオにまたがって提供する。 16-H800 GPUテストベッドでは、Frontierは平均スループットエラーを4%以下で達成している。最先端シミュレータと比較して、エンドツーエンドのレイテンシエラーを44.9%から6.4%に減らし、コロケーションでは51.7%から2.6%に減らした。これは、コモディティCPU上で1K以上のGPUにスケールし、SLA依存のParetoフロンティア探索、異種分離アロケーション、エージェント推論スケジューリングのバリデーション、RLポストトレーニング再構成などの新しいユースケースを可能にする。

論文の概要: Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

関連論文リスト