Fugu-MT 論文翻訳(概要): Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

論文の概要: Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

arxiv url: http://arxiv.org/abs/2512.17574v1
Date: Fri, 19 Dec 2025 13:40:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-22 19:25:54.406449
Title: Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing
Title（参考訳）: GPU-Internal Schedulingとリソース共有による非集約型MLLM推論の実現
Authors: Lingxiao Zhao, Haoran Zhou, Yuezhi Che, Dazhao Cheng,
Abstract要約: MLLM(Multimodal large language model)は、3段階のパイプラインを通して視覚的理解を拡張する。マルチモーダル前処理、特にビデオデコードがタイム・ツー・ファースト・トーケン(TTFT)を支配している我々は、エンドツーエンドのMLLMパイプラインを共同で最適化する2つの補完設計であるFlashCodecとUnifiedServeを紹介する。
参考スコア（独自算出の注目度）: 16.063514680699576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) extend LLMs with visual understanding through a three-stage pipeline: multimodal preprocessing, vision encoding, and LLM inference. While these stages enhance capability, they introduce significant system bottlenecks. First, multimodal preprocessing-especially video decoding-often dominates Time-to-First-Token (TTFT). Most systems rely on CPU-based decoding, which severely limits throughput, while existing GPU-based approaches prioritize throughput-oriented parallelism and fail to meet the latency-sensitive requirements of MLLM inference. Second, the vision encoder is a standalone, compute-intensive stage that produces visual embeddings and cannot be co-batched with LLM prefill or decoding. This heterogeneity forces inter-stage blocking and increases token-generation latency. Even when deployed on separate GPUs, these stages underutilize available compute and memory resources, reducing overall utilization and constraining system throughput. To address these challenges, we present FlashCodec and UnifiedServe, two complementary designs that jointly optimize the end-to-end MLLM pipeline. FlashCodec accelerates the multimodal preprocessing stage through collaborative multi-GPU video decoding, reducing decoding latency while preserving high throughput. UnifiedServe optimizes the vision-to-text and inference stages using a logically decoupled their execution to eliminate inter-stage blocking, yet physically sharing GPU resources to maximize GPU system utilization. By carefully orchestrating execution across stages and minimizing interference, UnifiedServe Together, our proposed framework forms an end-to-end optimized stack that can serve up to 3.0$\times$ more requests or enforce 1.5$\times$ tighter SLOs, while achieving up to 4.4$\times$ higher throughput compared to state-of-the-art systems.
Abstract（参考訳）: MLLM(Multimodal large language model)は、マルチモーダル前処理、ビジョンエンコーディング、LLM推論という3段階のパイプラインを通じて、LLMを視覚的理解で拡張する。これらのステージは能力を高めるが、システムボトルネックを著しく導入する。第一に、マルチモーダル前処理、特にビデオデコーディングが、TTFT(Time-to-First-Token)を支配している。ほとんどのシステムはCPUベースのデコーディングに依存しており、スループットを著しく制限する一方、GPUベースのアプローチではスループット指向の並列性が優先され、MLLM推論のレイテンシに敏感な要求を満たすことができない。第2に、視覚エンコーダは、視覚埋め込みを生成するスタンドアロンの計算集約的な段階であり、LLMプリフィルやデコードとコバッチできない。この異質性はステージ間ブロッキングを強制し、トークン生成遅延を増加させる。別々のGPU上にデプロイしても、これらのステージは利用可能な計算リソースとメモリリソースを過小評価し、全体的な使用率とシステムスループットの制約を低減します。これらの課題に対処するため、エンドツーエンドのMLLMパイプラインを共同で最適化する2つの補完設計であるFlashCodecとUnifiedServeを紹介します。 FlashCodecは、協調的なマルチGPUビデオデコーディングを通じて、マルチモーダル前処理ステージを加速し、高いスループットを維持しながらデコーディングのレイテンシを低減する。 UnifiedServeは、論理的に分離された実行を使用して、視覚とテキストと推論のステージを最適化して、ステージ間ブロッキングを排除しますが、GPUリソースを物理的に共有することで、GPUシステムの利用を最大化します。ステージ間の実行を慎重にオーケストレーションし、干渉を最小限にすることで、UnifiedServe Togetherで提案されたフレームワークは、最大3.0$\times$リクエストを処理したり、1.5$\times$タイトなSLOを強制したり、最先端のシステムと比較して最大4.4$\times$高いスループットを達成できるエンドツーエンドの最適化スタックを形成します。

論文の概要: Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

関連論文リスト