Fugu-MT 論文翻訳(概要): The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

論文の概要: The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

arxiv url: http://arxiv.org/abs/2603.21354v1
Date: Sun, 22 Mar 2026 18:30:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.373948
Title: The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
Title（参考訳）: LLM推論最適化のためのワークロードルータプールアーキテクチャ: vLLMセマンティックルータプロジェクトのビジョンペーパー
Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang,
Abstract要約: vLLM Semantic Routerプロジェクトは、信号駆動ルーティング、コンテキスト長プールルーティング、ルータパフォーマンスエンジニアリング、ポリシー競合検出、低レイテンシ組み込みモデル、カテゴリ認識セマンティックキャッシング、ユーザフィードバック駆動ルーティング適応、幻覚検出、プライバシーとジェイルブレイク保護のための階層的コンテンツ安全分類を対象とする一連の作業をリリースした。本稿では,LLM推論最適化のための3次元フレームワークであるWorkload-Pool-Poolアーキテクチャについて述べる。
参考スコア（独自算出の注目度）: 30.96691028676722
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.
Abstract（参考訳）: コアルーティング機構 -- 信号駆動ルーティング、コンテキスト長プールルーティング、ルータパフォーマンスエンジニアリング、ポリシコンフリクト検出、低レイテンシ組み込みモデル、カテゴリ認識セマンティックキャッシュ、ユーザフィードバック駆動ルーティング適応、幻覚的ルーティング検出、プライバシーとジェイルブレイク保護のための階層的コンテンツセーフティ分類、(2) 艦隊最適化 -- 艦隊のプロビジョニングとエネルギー効率分析、(3) エージェントおよびマルチモーダルルーティング -- マルチモーダルエージェントルーティング、ツール選択、CUAセキュリティ、マルチターンコンテキストメモリ、およびマルチターンAPI拡張。例えば、フリートプロビジョニングは、作業負荷の混合に依存するルーティングポリシーに依存し、組織がエージェントおよびマルチモーダルワークロードを採用するにつれてシフトする。本稿では,これらの結果を,LLM推論最適化のための3次元フレームワークであるWorkload-Router-Pool (WRP)アーキテクチャに精査する。 Workloadは、艦隊が提供しているもの(チャット対エージェント、シングルターン対マルチターン、ウォーム対コールド、プリフィル-ヘビー対デコード-ヘビー)を特徴付ける。ルータは、各リクエストの送信方法を決定する(静的セマンティックルール、オンラインバンディット適応、RLベースのモデル選択、品質を意識したカスケード)。 Poolは、推論がどこで実行されるかを定義する(同種対異種GPU、非集約型プリフィル/デコード、KV-キャッシュトポロジー)。我々は、これまでの研究成果を3x3 WRP相互作用マトリックスにマッピングし、どの細胞を被覆し、どの細胞を開いているかを識別し、それぞれが、工学的な完成度からオープンな研究までの成熟度を基準に、交差点における21の具体的な研究方向を提案する。

論文の概要: The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

関連論文リスト