Fugu-MT 論文翻訳(概要): Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

論文の概要: Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

arxiv url: http://arxiv.org/abs/2604.20420v1
Date: Wed, 22 Apr 2026 10:39:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:11.087067
Title: Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving
Title（参考訳）: スケーラブルなAI推論:AIモデルサービングのパフォーマンス分析と最適化
Authors: Hung Cuong Pham, Fatih Gedikli,
Abstract要約: 本研究では,グラフワークス.aiと連携して開発されたスケーラブルなモデル提供のためのベントMLベースのAI推論システムの性能と最適化について検討する。調査では、さまざまなワークロード下でのレイテンシとスループットのスケールアップ、ランタイム、サービス、デプロイメントレベルの最適化がレスポンス時間にどのように影響するか、単一ノードのK3sクラスタでのデプロイメントが障害時のレジリエンスに与える影響について検討した。
参考スコア（独自算出の注目度）: 0.9167082845109437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI research often emphasizes model design and algorithmic performance, while deployment and inference remain comparatively underexplored despite being critical for real-world use. This study addresses that gap by investigating the performance and optimization of a BentoML-based AI inference system for scalable model serving developed in collaboration with graphworks.ai. The evaluation first establishes baseline performance under three realistic workload scenarios. To ensure a fair and reproducible assessment, a pre-trained RoBERTa sentiment analysis model is used throughout the experiments. The system is subjected to traffic patterns following gamma and exponential distributions in order to emulate real-world usage conditions, including steady, bursty, and high-intensity workloads. Key performance metrics, such as latency percentiles and throughput, are collected and analyzed to identify bottlenecks in the inference pipeline. Based on the baseline results, optimization strategies are introduced at multiple levels of the serving stack to improve efficiency and scalability. The optimized system is then reevaluated under the same workload conditions, and the results are compared with the baseline using statistical analysis to quantify the impact of the applied improvements. The findings demonstrate practical strategies for achieving efficient and scalable AI inference with BentoML. The study examines how latency and throughput scale under varying workloads, how optimizations at the runtime, service, and deployment levels affect response time, and how deployment in a single-node K3s cluster influences resilience during disruptions.
Abstract（参考訳）: AI研究はしばしばモデル設計とアルゴリズムのパフォーマンスを強調している。本研究では、グラフワークス.aiと共同で開発されたスケーラブルなモデル提供のためのベントMLベースのAI推論システムの性能と最適化を調査することでギャップを解消する。評価はまず,3つの現実的なワークロードシナリオ下でのベースラインパフォーマンスを確立する。公正かつ再現可能な評価を確実にするために、実験全体を通して事前訓練されたRoBERTa感情分析モデルが使用される。このシステムは、安定、バースト、高強度のワークロードを含む現実世界の使用条件をエミュレートするために、ガンマ分布や指数分布に続くトラフィックパターンに従わなければならない。レイテンシパーセンタイルやスループットなどの重要なパフォーマンス指標が収集され、推論パイプラインのボトルネックを特定するために分析される。ベースライン結果に基づいて,サービススタックの複数のレベルで最適化戦略を導入し,効率性とスケーラビリティを向上する。最適化されたシステムは、同じワークロード条件下で再評価され、その結果を統計解析を用いてベースラインと比較して、適用された改善の効果を定量化する。この結果は、BentoMLで効率的でスケーラブルなAI推論を実現するための実践的な戦略を示している。調査では、さまざまなワークロード下でのレイテンシとスループットのスケールアップ、ランタイム、サービス、デプロイメントレベルの最適化がレスポンス時間にどのように影響するか、単一ノードのK3sクラスタでのデプロイメントが障害時のレジリエンスに与える影響について検討した。

論文の概要: Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

関連論文リスト