Fugu-MT 論文翻訳(概要): MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

論文の概要: MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

arxiv url: http://arxiv.org/abs/2510.03283v1
Date: Sun, 28 Sep 2025 18:45:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:58.645634
Title: MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment
Title（参考訳）: MACE:SLO対応連続調整アライメントを用いたハイブリッドLLMサービングシステム
Authors: Yufei Li, Yu Fu, Yue Dong, Cong Liu,
Abstract要約: エッジサーバにデプロイされる大規模言語モデル(LLM)は、パーソナライズされたアシスタント、レコメンデーション、コンテンツモデレーションといった遅延に敏感なアプリケーションでますます利用されている。既存のリトレーニング戦略は、モデル更新の遅延、再トレーニングのための過剰コミットリソース、イテレーションレベルのリトレーニングの粒度を見落としている。我々は,同時推論(プリフィル,デコード)と微調整を同時に行うハイブリッドLLMシステムであるMACEを提案し,知的メモリ管理により,推論スループットを約束しながらタスク性能を最大化する。
参考スコア（独自算出の注目度）: 14.392166280035122
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) deployed on edge servers are increasingly used in latency-sensitive applications such as personalized assistants, recommendation, and content moderation. However, the non-stationary nature of user data necessitates frequent retraining, which introduces a fundamental tension between inference latency and model accuracy under constrained GPU resources. Existing retraining strategies either delay model updates, over-commit resources to retraining, or overlook iteration-level retraining granularity. In this paper, we identify that iteration-level scheduling is crucial for adapting retraining frequency to model drift without violating service-level objectives (SLOs). We propose MACE, a hybrid LLM system that colocates concurrent inference (prefill, decode) and fine-tuning, with intelligent memory management to maximize task performance while promising inference throughput. MACE leverages the insight that not all model updates equally affect output alignment and allocates GPU cycles accordingly to balance throughput, latency, and update freshness. Our trace-driven evaluation shows that MACE matches or exceeds continuous retraining while reducing inference latency by up to 63% and maintaining throughput under resource constraints. Compared to periodic retraining, MACE improves latency breakdown across prefill, decode, and finetune stages, and sustains GPU utilization above 85% in NVIDIA AGX Orin. These results demonstrate that iteration-level hybrid scheduling is a promising direction for deploying LLMs with continual learning capabilities on edge platforms.
Abstract（参考訳）: エッジサーバにデプロイされる大規模言語モデル(LLM)は、パーソナライズされたアシスタント、レコメンデーション、コンテンツモデレーションといった遅延に敏感なアプリケーションでますます利用されている。しかし、ユーザデータの非定常的な性質は頻繁な再トレーニングを必要とするため、推論レイテンシと制約付きGPUリソース下でのモデルの正確性の間に根本的な緊張が生じます。既存のリトレーニング戦略は、モデル更新の遅延、再トレーニングのための過剰コミットリソース、イテレーションレベルのリトレーニングの粒度を見落としている。本稿では、サービスレベル目標(SLO)に違反することなく、リトレーニング周波数をモデルドリフトに適応させるためには、繰り返しレベルのスケジューリングが不可欠であることを示す。我々は,同時推論(プリフィル,デコード)と微調整を同時に行うハイブリッドLLMシステムであるMACEを提案し,知的メモリ管理により,推論スループットを約束しながらタスク性能を最大化する。 MACEは、すべてのモデル更新が出力アライメントに等しく影響を与えないという洞察を活用し、バランスの取れたスループット、レイテンシ、更新の鮮度に応じてGPUサイクルを割り当てる。我々のトレース駆動評価では、MACEは連続的な再トレーニングと一致または超過し、推論遅延を最大63%削減し、リソース制約下でのスループットを維持する。定期的な再トレーニングと比較して、MACEはプリフィル、デコード、ファインチューンステージにわたるレイテンシの低下を改善し、NVIDIA AGX Orinの85%以上のGPU使用率を維持する。これらの結果から,反復レベルのハイブリッドスケジューリングは,エッジプラットフォーム上で連続学習機能を備えたLLMをデプロイする上で有望な方向であることが示された。

論文の概要: MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

関連論文リスト