Fugu-MT 論文翻訳(概要): M*: A Modular, Extensible, Serving System for Multimodal Models

論文の概要: M*: A Modular, Extensible, Serving System for Multimodal Models

arxiv url: http://arxiv.org/abs/2606.12688v2
Date: Sat, 13 Jun 2026 05:57:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 13:45:31.11842
Title: M*: A Modular, Extensible, Serving System for Multimodal Models
Title（参考訳）: M*:マルチモーダルモデルのためのモジュール型拡張型サービングシステム
Authors: Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang,
Abstract要約: 本稿では,複合AIモデルの効率的な提供を目的とした汎用サービスシステムであるM*を提案する。広範囲の家族から合成モデルを簡潔にキャプチャする方法を示す。 M*はまた、ロボット計画のためのV-JEPA 2-ACロールアウトベースラインを最大12.5倍上回っている。
参考スコア（独自算出の注目度）: 62.77975969000349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.
Abstract（参考訳）: 我々は、視覚エンコーダ、言語バックボーン、拡散とフローヘッド、オーディオコーデック、アクションジェネレータ、世界モデル予測器といった多様なコンポーネントを統合する複合モデルアーキテクチャの新しい時代に入った。このようなアーキテクチャは、統一マルチモーダルモデル、オムニモデル、言語モデル、視覚言語アクションポリシー、世界モデルなど、幅広い種類のマルチモーダルモデルを支える。しかし、既存のモデル提供フレームワークはモデル構造に関する狭い前提に基づいて構築されており、この新しいアーキテクチャの多様性に対応するのに不適当である。本稿では,複合AIモデルの効率的な提供を目的とした汎用サービスシステムであるM*について述べる。 M*はモデルをデータフローグラフとして表現し、さまざまなモダリティとタスクをまたいだ要求をこれらのグラフ上のトラバーサルとして処理する。中心となる洞察は、モデルコンポーネントの任意の構成、物理クラスタへの柔軟な配置、分散ランタイム内のモデル非依存の最適化をサポートするモジュール化された抽象化である。この抽象化をウォークグラフと呼び、広範囲のファミリーから合成モデルを簡潔にキャプチャする方法を示します。我々は代表モデル上でM*をインスタンス化し、BAGEL上のテキスト・トゥ・イメージのワークロードでは平均20%のレイテンシを vLLM-Omni よりも低くし、Qwen3-Omni 上でのテキスト・トゥ・音声のワークロードでは最大2.9倍のリアルタイム係数と2.7倍のスループットを提供する。 M*はまた、ロボット計画のためのV-JEPA 2-ACロールアウトベースラインを最大12.5倍上回っている。このように、当社の作業は、開発者の最小限の労力で、複雑なモデルのより効率的な提供に向けた道を開いたのです。

論文の概要: M*: A Modular, Extensible, Serving System for Multimodal Models

関連論文リスト