Fugu-MT 論文翻訳(概要): Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

論文の概要: Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

arxiv url: http://arxiv.org/abs/2603.26498v1
Date: Fri, 27 Mar 2026 15:00:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.560328
Title: Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference
Title（参考訳）: Rocks, Pebbles, Sand: マルチモーダルな大規模言語モデル推論のためのモダリティ対応スケジューリング
Authors: Konstantinos Papaioannou, Thaleia Dimitra Doudali,
Abstract要約: MLLM(Multimodal Large Language Models)は、ChatGPT、Gemini、Copilotなどのプラットフォームで、テキスト、画像、ビデオとのリッチなインタラクションを可能にする。既存のLLMサービスシステムは、リソースを独占し、ラインのブロッキングとパフォーマンスの低下を引き起こす。 RPS-Serveは、砂が小石や岩の中を素早く流れ、飢餓を避けながら対話的な応答性を確保するためのモダリティ対応スケジューラである。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like rocks, images like pebbles, and text like sand. We design RPS-Serve, a modality-aware scheduler that lets sand flow quickly through pebbles and rocks, ensuring interactive responsiveness while avoiding starvation. RPS-Serve classifies requests, prioritizes them dynamically, and applies aging to avoid starvation. Evaluation across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems. RPS-Serve delivers LLM-like responsiveness for MLLMs, with modality-aware scheduling and by making the most efficient use of the available resources.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、ChatGPT、Gemini、Copilotなどのプラットフォームで、テキスト、画像、ビデオとのリッチなインタラクションを可能にする。これらの異種ワークロードは、視覚前処理やエンコーディングなどの追加の推論ステージを導入し、レイテンシとメモリ需要を増大させる。既存のLLMサービスシステムは、テキストのみのワークロードに最適化されており、大きなリクエスト(例:ビデオ)がリソースを独占し、ラインのブロッキングとパフォーマンスの低下を引き起こしている。ビデオは岩のように振る舞い、小石のような画像、砂のようなテキストのように振る舞う。 RPS-Serveは、砂が小石や岩の中を素早く流れ、飢餓を避けながら対話的な応答性を確保するためのモダリティ対応スケジューラである。 RPS-Serveはリクエストを分類し、動的に優先順位付けし、飢餓を避けるために老化を適用する。最先端のMLLMによる評価では、RTS-Serveは、現在のシステムと比較して、平均してTTFT(time-to-first-token)を54%削減し、レイテンシクリティカルな要求に対して78.5%削減している。 RPS-ServeはMLLMに対してLLMのような応答性を提供し、モダリティを考慮したスケジューリングを行い、利用可能なリソースを最大限に活用する。

論文の概要: Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

関連論文リスト