Fugu-MT 論文翻訳(概要): StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

論文の概要: StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

arxiv url: http://arxiv.org/abs/2603.05800v1
Date: Fri, 06 Mar 2026 01:22:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:44.87647
Title: StreamWise: Serving Multi-Modal Generation in Real-Time at Scale
Title（参考訳）: StreamWise: 大規模でリアルタイムにマルチモーダル生成を実現する
Authors: Haoran Qiu, Gohar Irfan Chaudhry, Chaojie Zhang, Íñigo Goiri, Esha Choukse, Rodrigo Fonseca, Ricardo Bianchini,
Abstract要約: マルチモーダル生成モデルは、ストーリーテリングから自動メディア合成まで、新しいアプリケーションを可能にする。現在、大規模にリアルタイムなマルチモーダルを提供するには費用がかかり複雑であり、多様なモデルの効率的な調整が必要である。適応的でモジュール型のサービスシステムであるStreamWiseを設計し、品質(解像度、シャープさなど)、モデル/コンテンツ並列性、リソースを意識したスケジューリングを動的に管理する。
参考スコア（独自算出の注目度）: 7.73695790907204
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Advances in multi-modal generative models are enabling new applications, from storytelling to automated media synthesis. Most current workloads generate simple outputs (e.g., image generation from a prompt) in batch mode, often requiring several seconds even for basic results. Serving real-time multi-modal workflows at scale is costly and complex, requiring efficient coordination of diverse models (each with unique resource needs) across language, audio, image, and video, all under strict latency and resource constraints. We tackle these challenges through the lens of real-time podcast video generation, integrating LLMs, text-to-speech, and video-audio generation. To meet tight SLOs, we design an adaptive, modular serving system, StreamWise, that dynamically manages quality (e.g., resolution, sharpness), model/content parallelism, and resource-aware scheduling. We leverage heterogeneous hardware to maximize responsiveness and efficiency. For example, the system can lower video resolution and allocate more resources to early scenes. We quantify the trade-offs between latency, cost, and quality. The cheapest setup generates a 10-minute podcast video on A100 GPUs in 1.4 hours (8.4x slower than the real-time) for less than \$25. StreamWise enables high-quality real-time streaming with a sub-second startup delay under $45.
Abstract（参考訳）: マルチモーダル生成モデルの進歩は、ストーリーテリングから自動メディア合成まで、新しいアプリケーションを可能にしている。現在のワークロードのほとんどは、バッチモードで単純なアウトプット(プロンプトからのイメージ生成など)を生成し、基本的な結果であっても数秒を要します。大規模にリアルタイムなマルチモーダルワークフローを実行するのは費用がかかり複雑で、言語、オーディオ、画像、ビデオなど、さまざまなモデルの効率的な調整を必要とする。これらの課題は、リアルタイムポッドキャストビデオ生成のレンズ、LLM、テキスト音声合成、ビデオオーディオ生成の統合によって解決される。厳密なSLOを満たすために、我々は、品質(例えば、解像度、シャープネス)、モデル/コンテンツ並列性、リソース対応スケジューリングを動的に管理する適応型モジュール型サービスシステムStreamWiseを設計する。我々は不均一なハードウェアを活用し、応答性と効率を最大化する。例えば、システムはビデオ解像度を下げ、早期シーンにより多くのリソースを割り当てることができる。レイテンシ、コスト、品質の間のトレードオフを定量化します。最も安価なセットアップでは、A100 GPU上で10分間のポッドキャストビデオを11.4時間(リアルタイムより8.4倍遅い)で25ドル以下で生成する。 StreamWiseは、秒未満の起動遅延を45ドル以下で、高品質なリアルタイムストリーミングを可能にする。

論文の概要: StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

関連論文リスト