Fugu-MT 論文翻訳(概要): Training-Free Multimodal Large Language Model Orchestration

論文の概要: Training-Free Multimodal Large Language Model Orchestration

arxiv url: http://arxiv.org/abs/2508.10016v1
Date: Wed, 06 Aug 2025 16:17:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 22:24:48.009673
Title: Training-Free Multimodal Large Language Model Orchestration
Title（参考訳）: 学習不要マルチモーダル大言語モデルオーケストレーション
Authors: Tianyu Xie, Yuhang Wu, Yongdong Luo, Jiayi Ji, Xiawu Zheng,
Abstract要約: 本稿では,対話型マルチモーダルAIシステムを構築するための効果的なアプローチについて報告する。本フレームワークは,(1)ユーザ入力を解析する中央コントローラ,(2)並列テキスト音声アーキテクチャ,(3)クロスモーダルメモリ統合という3つの重要なイノベーションに基づいて構築されている。
参考スコア（独自算出の注目度）: 16.211979950149928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.
Abstract（参考訳）: 異なるマルチモーダル大言語モデル(MLLM)は、直接的に統合されたマルチモーダル入力出力システムに統合できない。これまでの研究では、モーダルアライメント、テキスト音声の効率性、その他の統合上の問題などにより、トレーニングは避けられないコンポーネントとみなされてきた。本稿では,対話型マルチモーダルAIシステムを構築するための効果的な手法であるマルチモーダル大言語モデルオーケストレーションを提案する。 MLLMオーケストレーションは、大規模言語モデルの固有の推論機能を活用して、明示的なワークフローを通じて特別なモデルをコーディネートし、モジュール性を維持しながら自然なマルチモーダルインタラクションを可能にし、解釈性を改善し、計算効率を大幅に向上させる。筆者らのオーケストレーションフレームワークは,(1)ユーザ入力を解析し,慎重に設計されたエージェントを介してタスクを適切な特殊モデルに動的にルーティングする中央制御機構,(2)シームレスな割り込み処理と自然な会話フローとの真の完全二重インタラクションを実現する並列テキスト・音声アーキテクチャ,(3)インテリジェントな情報合成と検索によってモジュール間のコヒーレントなコンテキストを維持し,応答速度を改善するための不要なモーダルコールを選択的に回避するクロスモーダルメモリ統合システム,という3つの重要なイノベーションに基づいて構築されている。大規模な評価では、MLLMオーケストレーションは、追加のトレーニングなしで包括的なマルチモーダル機能を実現し、標準ベンチマークにおける従来の共同トレーニングアプローチよりも最大7.8%の性能向上、レイテンシの10.3%削減、明示的なオーケストレーションプロセスによる解釈可能性の向上を実現している。

論文の概要: Training-Free Multimodal Large Language Model Orchestration

関連論文リスト