Fugu-MT 論文翻訳(概要): From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models

論文の概要: From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models

arxiv url: http://arxiv.org/abs/2511.02248v1
Date: Tue, 04 Nov 2025 04:26:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:05.810195
Title: From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models
Title（参考訳）: モデルからオペレータへ:大規模生成モデルにおけるオートスケーリングの粒度再考
Authors: Xingqi Cui, Chieh-Jan Mike Liang, Jiarong Xing, Haoran Qiu,
Abstract要約: 既存のソリューションは静的プロビジョニングやモデルレベルのオートスケーリングに依存している。この粗粒度の粗いリソース管理は、性能の低下や重要なリソース不使用につながる。本稿では,より粒度の高いリソースを割り当てる演算子レベルのオートスケーリングフレームワークを提案する。
参考スコア（独自算出の注目度）: 4.720658518775265
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static provisioning or model-level autoscaling, both of which treat the model as a monolith. This coarse-grained resource management leads to degraded performance or significant resource underutilization due to poor adaptability to dynamic inference traffic that is common online. The root cause of this inefficiency lies in the internal structure of generative models: they are executed as graphs of interconnected operators. Through detailed characterization and systematic analysis, we find that operators are heterogeneous in their compute and memory footprints and exhibit diverse sensitivity to workload and resource factors such as batch size, sequence length, and traffic rate. This heterogeneity suggests that the operator, rather than the entire model, is the right granularity for scaling decisions. We propose an operator-level autoscaling framework, which allocates resources at finer (operator)-granularity, optimizing the scaling, batching, and placement based on individual operator profiles. Evaluated on production-scale traces, our approach preserves SLOs with up to 40% fewer GPUs and 35% less energy, or under fixed resources achieves 1.6x higher throughput with 5% less energy. These results show that the operator, rather than the model, is fundamentally a more effective unit for scaling large generative workloads.
Abstract（参考訳）: LLMやマルチモーダルトランスフォーマーのような大規模な生成モデルを実現するには、ユーザ対応のSLO(例えば、タイム・ツー・ファースト・ツーケン、タイム・バイ・ツー・ツーケン)と、効率性とコスト削減のプロバイダ目標とのバランスが必要となる。既存のソリューションは静的プロビジョニングやモデルレベルのオートスケーリングに依存しており、どちらもモデルをモノリスとして扱う。この粗粒度の粗いリソース管理は、オンラインの一般的な動的推論トラフィックへの適応性が低いため、性能の低下や重要なリソース不使用につながる。この非効率性の根本原因は生成モデルの内部構造にある:それらは相互接続作用素のグラフとして実行される。詳細な特徴解析と系統解析により,演算子は計算量やメモリフットプリントにおいて不均一であり,ワークロードやバッチサイズ,シーケンス長,トラフィックレートなどのリソース要因に対する多様な感度を示すことがわかった。この不均一性は、オペレーターがモデル全体ではなく、決定をスケールするのに適切な粒度であることを示唆している。本稿では,個々の演算子プロファイルに基づくスケーリング,バッチ,配置を最適化し,より細かい(演算子)粒度でリソースを割り当てる演算子レベルのオートスケーリングフレームワークを提案する。実運用規模のトレースに基づいて評価し,SLOを最大40%のGPUと35%のエネルギーで保存する。これらの結果は、オペレーターがモデルではなく、基本的に大規模な生成ワークロードをスケールする上で、より効果的なユニットであることを示している。

論文の概要: From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models

関連論文リスト