Fugu-MT 論文翻訳(概要): ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

論文の概要: ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

arxiv url: http://arxiv.org/abs/2604.00136v1
Date: Tue, 31 Mar 2026 18:41:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.682808
Title: ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
Title（参考訳）: ParetoBandit:Non-Stationary LLMServingのための予算適用適応ルーティング
Authors: Annette Taberner-Miller,
Abstract要約: LLMは、しばしば530倍のコスト範囲にまたがるマルチモデルポートフォリオに依存している。プロバイダは価格を見直し、モデルの品質は静かに回復し、新しいモデルはダウンタイムなしで統合する必要がある。本稿では,費用対効果を考慮した適応ルータを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.
Abstract（参考訳）: プロダクションLLMは、しばしば530倍のコスト範囲にまたがるマルチモデルポートフォリオに依存し、ルーティング決定はコストに対して品質をトレードオフする。プロバイダは価格を見直し、モデルの品質は静かに回復し、新しいモデルはダウンタイムなしで統合する必要がある。 ParetoBanditというオープンソースの適応ルータは、コストを意識したコンテキスト帯の上に構築され、同時にドル建ての予算を強制し、そのようなシフトにオンラインで適応し、実行時に新しいモデルをオンボードする。 ParetoBanditはこのギャップを3つのメカニズムで埋める。オンラインの原始的予算ペースメーカーは、オフラインのペナルティチューニングをクローズドループ制御に置き換え、要求ごとのコスト天井をオープンエンドストリームに強制する。十分な統計量に関する幾何学的忘れは、オフラインの事前からブートストラップしながら、価格と品質の急激な変更を可能にする。ホットスワップレジストリは、オペレーターが各新参者に対して短時間の強制探索フェーズで、実行時にモデルを追加または削除することを可能にする。私たちはParetoBanditを,3モデルポートフォリオを経由した1,824のプロンプトで,4つのデプロイメントシナリオで評価した。 7つの予算の天井を越えれば、要求毎のコストが0.4%を超えることはない。コストのかかるモデルに対するオーダー・オブ・マグニチュードの値下げは、条件が変化すると+0.071品質上昇となり、予算内でサイレント品質の劣化を検出して再帰する。コールドスタートされたモデルは、コスト天井を破ることなく、約142ステップで有意義な採用に達する。高価なモデルは予算化されており、境界探索後に低品質のモデルが拒否される。エンドツーエンドのルーティングレイテンシはCPU上で9.8msであり、典型的な推論時間の0.4%未満である。

論文の概要: ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

関連論文リスト