Fugu-MT 論文翻訳(概要): AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

論文の概要: AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

arxiv url: http://arxiv.org/abs/2601.06288v1
Date: Fri, 09 Jan 2026 20:03:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:00.733062
Title: AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving
Title（参考訳）: AIConfigurator: マルチフレームLLMサービングのためのライトニングファスト構成最適化
Authors: Tianhao Xu, Yiming Liu, Xianglong Lu, Yijia Zhao, Xuting Zhou, Aichen Feng, Yiyi Chen, Yi Shen, Qin Zhou, Xumeng Chen, Ilya Sherstyuk, Haorui Li, Rishi Thakkar, Ben Hamm, Yuanzhe Li, Xue Huang, Wenpeng Wu, Anish Shanbhag, Harry Kim, Chuan Chen, Junjie Lai,
Abstract要約: AIConfiguratorは、Large Language Model(LLM)推論のための統一されたパフォーマンスモデリングシステムである。 GPUベースのプロファイリングを必要とせずに、迅速なフレームワークベースの構成検索を可能にする。これは、高密度モデルのパフォーマンスを最大40%向上させる優れたサービス構成を特定する。
参考スコア（独自算出の注目度）: 16.664502126572856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optimizing Large Language Model (LLM) inference in production systems is increasingly difficult due to dynamic workloads, stringent latency/throughput targets, and a rapidly expanding configuration space. This complexity spans not only distributed parallelism strategies (tensor/pipeline/expert) but also intricate framework-specific runtime parameters such as those concerning the enablement of CUDA graphs, available KV-cache memory fractions, and maximum token capacity, which drastically impact performance. The diversity of modern inference frameworks (e.g., TRT-LLM, vLLM, SGLang), each employing distinct kernels and execution policies, makes manual tuning both framework-specific and computationally prohibitive. We present AIConfigurator, a unified performance-modeling system that enables rapid, framework-agnostic inference configuration search without requiring GPU-based profiling. AIConfigurator combines (1) a methodology that decomposes inference into analytically modelable primitives - GEMM, attention, communication, and memory operations while capturing framework-specific scheduling dynamics; (2) a calibrated kernel-level performance database for these primitives across a wide range of hardware platforms and popular open-weights models (GPT-OSS, Qwen, DeepSeek, LLama, Mistral); and (3) an abstraction layer that automatically resolves optimal launch parameters for the target backend, seamlessly integrating into production-grade orchestration systems. Evaluation on production LLM serving workloads demonstrates that AIConfigurator identifies superior serving configurations that improve performance by up to 40% for dense models (e.g., Qwen3-32B) and 50% for MoE architectures (e.g., DeepSeek-V3), while completing searches within 30 seconds on average. Enabling the rapid exploration of vast design spaces - from cluster topology down to engine specific flags.
Abstract（参考訳）: 大規模言語モデル(LLM)の本番システムでの推論の最適化は、動的ワークロード、厳格なレイテンシ/スループットターゲット、そして急速に拡大する構成空間のため、ますます困難になっている。この複雑さは分散並列処理戦略(テンソル/ピペリン/エキスパート)だけでなく、CUDAグラフの有効化、利用可能なKVキャッシュメモリ率、パフォーマンスに大きな影響を与える最大トークン容量など、フレームワーク固有の実行パラメータにも及んでいる。現代の推論フレームワーク(TRT-LLM、vLLM、SGLangなど)の多様性は、それぞれ異なるカーネルと実行ポリシーを採用しており、フレームワーク固有の手動チューニングと計算的に禁止されている。我々は,GPUベースのプロファイリングを必要とせずに,フレームワークに依存しない高速な推論設定検索を可能にする,統合されたパフォーマンスモデリングシステムであるAIConfiguratorを提案する。 AIConfiguratorは、(1)フレームワーク固有のスケジューリングのダイナミクスを捉えながら、推論を分析的にモデル化可能なプリミティブ(GEMM、注意、コミュニケーション、メモリ操作)に分解する方法論、(2)幅広いハードウェアプラットフォームと一般的なオープンウェイトモデル(GPT-OSS、Qwen、DeepSeek、LLama、Mistral)にわたるプリミティブのためのキャリブレーションされたカーネルレベルのパフォーマンスデータベース、(3)ターゲットバックエンドの最適なローンチパラメータを自動的に解決し、運用レベルのオーケストレーションシステムにシームレスに統合する抽象化レイヤを組み合わせる。実運用LLMサービスワークロードの評価によると、AIConfiguratorは、高密度モデル(例えば、Qwen3-32B)とMoEアーキテクチャ(例えば、DeepSeek-V3)のパフォーマンスを最大40%向上する優れたサービス構成を特定し、平均30秒以内に検索を完了している。クラスタトポロジからエンジン固有のフラグに至るまで、広大なデザインスペースの迅速な探索を実現する。

論文の概要: AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

関連論文リスト